Re: spark sql - reading data from sql tables having space in column names

2015-06-07 Thread Cheng Lian

You can use backticks to quote the column names.

Cheng

On 6/3/15 2:49 AM, David Mitchell wrote:


I am having the same problem reading JSON.  There does not seem to be 
a way of selecting a field that has a space, Executor Info from the 
Spark logs.


I suggest that we open a JIRA ticket to address this issue.

On Jun 2, 2015 10:08 AM, ayan guha guha.a...@gmail.com 
mailto:guha.a...@gmail.com wrote:


I would think the easiest way would be to create a view in DB with
column names with no space.

In fact, you can pass a sql in place of a real table.

From documentation: The JDBC table that should be read. Note that
anything that is valid in a `FROM` clause of a SQL query can be
used. For example, instead of a full table you could also use a
subquery in parentheses.

Kindly let the community know if this works

On Tue, Jun 2, 2015 at 6:43 PM, Sachin Goyal
sachin.go...@jabong.com mailto:sachin.go...@jabong.com wrote:

Hi,

We are using spark sql (1.3.1) to load data from Microsoft sql
server using jdbc (as described in

https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases).


It is working fine except when there is a space in column
names (we can't modify the schemas to remove space as it is a
legacy database).

Sqoop is able to handle such scenarios by enclosing column
names in '[ ]' - the recommended method from microsoft sql
server.

(https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/manager/SQLServerManager.java
- line no 319)

Is there a way to handle this in spark sql?

Thanks,
sachin




-- 
Best Regards,

Ayan Guha





Re: Problem reading Parquet from 1.2 to 1.3

2015-06-07 Thread Don Drake
Thanks Cheng,  we have a workaround in place for Spark 1.3 (remove
.metadata directory), good to know it will be resolved in 1.4.

-Don

On Sun, Jun 7, 2015 at 8:51 AM, Cheng Lian lian.cs@gmail.com wrote:

  This issue has been fixed recently in Spark 1.4
 https://github.com/apache/spark/pull/6581

 Cheng


 On 6/5/15 12:38 AM, Marcelo Vanzin wrote:

 I talked to Don outside the list and he says that he's seeing this issue
 with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a
 real issue here.

 On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com wrote:

 As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that
 Spark is behaving differently when reading Parquet directories that contain
 a .metadata directory.

  It seems that in spark 1.2.x, it would just ignore the .metadata
 directory, but now that I'm using Spark 1.3, reading these files causes the
 following exceptions:

  scala val d = sqlContext.parquetFile(/user/ddrak/parq_dir)

 SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder.

 SLF4J: Defaulting to no-operation (NOP) logger implementation

 SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
 further details.

 scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown
 during a parallel computation: java.lang.RuntimeException:
 hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is not a
 Parquet file. expected magic number at tail [80, 65, 82, 49] but found
 [116, 34, 10, 125]

 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)


 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)


 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

 scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)


 scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

 scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

 scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

 scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

 scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

 .

 .

 .



 java.lang.RuntimeException:
 hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc is not
 a Parquet file. expected magic number at tail [80, 65, 82, 49] but found
 [116, 34, 10, 125]

 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)


 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)


 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

 scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)


 scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

 scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

 scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

 scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

 scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

 .

 .

 .



 java.lang.RuntimeException:
 hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties
 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but
 found [117, 101, 116, 10]

 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

 parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)


 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)


 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

 scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)


 scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

 scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

 scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

 scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

 scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

 .

 .

 .

 at
 scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)

 at
 scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)

 at
 scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)

 at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)

 at
 scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)

 at
 scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)

 at
 

Re: Not understanding manually building EC2 cluster

2015-06-07 Thread Akhil

- Remove localhost from the conf/slaves file, add the slaves private ip.
- Make sure master and slave machines are on the same security group (this
way all ports will be accessible to all machines)
- In conf/spark-env.sh file, place export
SPARK_MASTER_IP=MASTER-NODES-PUBLIC-OR-PRIVATE-IP and remove SPARK_LOCAL_IP

These changes should get you started with spark cluster. If not, look in the
logs file for more detailed information.


bjameshunter wrote
 Hi,
 
 I've tried a half a dozen times to build a spark cluster on EC2, without
 using the ec2 scripts or EMR. I'd like to eventually get an IPython
 notebook server running on the master, and the ec2 scripts and EMR don't
 seem accommodating for that. 
 
 I build an Ubuntu-spark-ipython machine. 
 I setup the Ipython server. Ipython and spark work together. 
 I make an image of the machine, spin up two of them. I can ssh between the
 original (master) and two new slaves without password (put master
 id_rsa.pub on slaves). 
 I add slave public IPs to $SPARK_HOME/conf/slaves, underneath localhost
 I execute $SPARK_HOME/sbin/start-all.sh
 Slaves and master start. The master GUI shows only one slave - itself.
 ---
 Here's where, I think the documentation ends for me and I start trying
 random stuff.
 setting SPARK_MASTER_IP to the EC2 public IP on all machines. Setting
 SPARK_LOCAL_IP to 127.0.01.
 Changing hostname on master to the public IP, changing it to my domain
 name prefixed with spark-master and routing it through my digital ocean
 account, etc.
 
 All combinations of the above steps have been tried, and then some.
 
 Any clue what I don't understand here?
 
 Thanks,
 Ben
 
 https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Not-understanding-manually-building-EC2-cluster-tp23194p23195.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Monitoring Spark Jobs

2015-06-07 Thread Otis Gospodnetić
Hi Sam,

Have a look at Sematext's SPM for your Spark monitoring needs. If the
problem is CPU, IO, Network, etc. as Ahkil mentioned, you'll see that in
SPM, too.
As for the number of jobs running, you have see a chart with that at
http://sematext.com/spm/integrations/spark-monitoring.html

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Sun, Jun 7, 2015 at 6:37 AM, SamyaMaiti samya.maiti2...@gmail.com
wrote:

 Hi All,

 I have a Spark SQL application to fetch data from Hive, on top I have a
 akka
 layer to run multiple Queries in parallel.

 *Please suggest a mechanism, so as to figure out the number of spark jobs
 running in the cluster at a given instance of time. *

 I need to do the above as, I see the average response time increasing with
 increase in number of requests, in-spite of increasing the number of cores
 in the cluster. I suspect there is a bottleneck somewhere else.

 Regards,
 Sam



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-Spark-Jobs-tp23193.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Running SparkSql against Hive tables

2015-06-07 Thread Cheng Lian



On 6/6/15 9:06 AM, James Pirz wrote:
I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 
'Spark SQL' to run some SQL scripts, on the cluster. I realized that 
for a better performance, it is a good idea to use Parquet files. I 
have 2 questions regarding that:


1) If I wanna use Spark SQL against  *partitioned  bucketed* tables 
with Parquet format in Hive, does the provided spark binary on the 
apache website support that or do I need to build a new spark binary 
with some additional flags ? (I found a note 
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables in 
the documentation about enabling Hive support, but I could not fully 
get it as what the correct way of building is, if I need to build)
Yes, Hive support is enabled by default now for the binaries on the 
website. However, currently Spark SQL doesn't support buckets yet.


2) Does running Spark SQL against tables in Hive downgrade the 
performance, and it is better that I load parquet files directly to 
HDFS or having Hive in the picture is harmless ?
If you're using Parquet, then it should be fine since by default Spark 
SQL uses its own native Parquet support to read Parquet Hive tables.


Thnx





Re: hiveContext.sql NullPointerException

2015-06-07 Thread patcharee

Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all datasets 
(very large) to the driver and use HiveContext there? It will be memory 
overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side. However, the closure passed to RDD.foreach is executed on 
executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey is 
to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But 
the hiveContext.sql  throws NullPointerException, see below. Any 
suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) 
 +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Caching parquet table (with GZIP) on Spark 1.3.1

2015-06-07 Thread Cheng Lian
Is it possible that some Parquet files of this data set have different 
schema as others? Especially those ones reported in the exception messages.


One way to confirm this is to use [parquet-tools] [1] to inspect these 
files:


$ parquet-schema path-to-file

Cheng

[1]: https://github.com/apache/parquet-mr/tree/master/parquet-tools

On 5/26/15 3:26 PM, shsh...@tsmc.com wrote:


we tried to cache table through
hiveCtx = HiveContext(sc)
hiveCtx.cacheTable(table name)
as described on Spark 1.3.1's document and we're on CDH5.3.0 with Spark
1.3.1 built with Hadoop 2.6
following error message would occur if we tried to cache table with parquet
format  GZIP
though we're not sure if this error message has anything to do with the
table format since we can execute SQLs on the exact same table,
we just hope to use cachTable so that it might speed-up a little bit since
we're querying on this table for several times.
Any advise is welcomed! Thanks!

15/05/26 15:21:32 WARN scheduler.TaskSetManager: Lost task 227.0 in stage
0.0 (TID 278, f14ecats037): parquet.io.ParquetDecodingException: Can not
read value at 0 in block -1 in file
hdfs://f14ecat/tmp/tchart_0501_final/part-r-1198.parquet
 at parquet.hadoop.InternalParquetRecordReader.nextKeyValue
(InternalParquetRecordReader.java:213)
 at parquet.hadoop.ParquetRecordReader.nextKeyValue
(ParquetRecordReader.java:204)
 at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext
(NewHadoopRDD.scala:143)
 at org.apache.spark.InterruptibleIterator.hasNext
(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon
$1.hasNext(InMemoryColumnarTableScan.scala:153)
 at org.apache.spark.storage.MemoryStore.unrollSafely
(MemoryStore.scala:248)
 at org.apache.spark.CacheManager.putInBlockManager
(CacheManager.scala:172)
 at org.apache.spark.CacheManager.getOrCompute
(CacheManager.scala:79)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.MapPartitionsRDD.compute
(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask
(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run
(Executor.scala:203)
 at java.util.concurrent.ThreadPoolExecutor.runWorker
(ThreadPoolExecutor.java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.io.ParquetDecodingException: The requested schema is not
compatible with the file schema. incompatible types: optional binary
dcqv_val (UTF8) != optional double dcqv_val
 at parquet.io.ColumnIOFactory
$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105)
 at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit
(ColumnIOFactory.java:97)
 at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386)
 at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren
(ColumnIOFactory.java:87)
 at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit
(ColumnIOFactory.java:61)
 at parquet.schema.MessageType.accept(MessageType.java:55)
 at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148)
 at parquet.hadoop.InternalParquetRecordReader.checkRead
(InternalParquetRecordReader.java:125)
 at parquet.hadoop.InternalParquetRecordReader.nextKeyValue
(InternalParquetRecordReader.java:193)
 ... 31 more

15/05/26 15:21:32 INFO scheduler.TaskSetManager: Starting task 74.2 in
stage 0.0 (TID 377, f14ecats025, NODE_LOCAL, 2153 bytes)
15/05/26 15:21:32 INFO scheduler.TaskSetManager: Lost task 56.2 in stage
0.0 

Monitoring Spark Jobs

2015-06-07 Thread SamyaMaiti
Hi All,

I have a Spark SQL application to fetch data from Hive, on top I have a akka
layer to run multiple Queries in parallel.

*Please suggest a mechanism, so as to figure out the number of spark jobs
running in the cluster at a given instance of time. *

I need to do the above as, I see the average response time increasing with
increase in number of requests, in-spite of increasing the number of cores
in the cluster. I suspect there is a bottleneck somewhere else.

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-Spark-Jobs-tp23193.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Avro or Parquet ?

2015-06-07 Thread Cheng Lian
Usually Parquet can be more efficient because of its columnar nature. 
Say your table has 10 columns but your join query only touches 3 of 
them, Parquet only reads those 3 columns from disk while Avro must load 
all data.


Cheng

On 6/5/15 3:00 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
We currently have data in avro format and we do joins between avro and 
sequence file data.

Will storing these datasets in Parquet make joins any faster ?

The dataset sizes are beyond are between 500 to 1000 GB.
--
Deepak




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian
Spark SQL supports Hive dynamic partitioning, so one possible workaround 
is to create a Hive table partitioned by zone, z, year, and month 
dynamically, and then insert the whole dataset into it directly.


In 1.4, we also provides dynamic partitioning support for non-Hive 
environment, and you can do something like this:


df.write.partitionBy(zone, z, year, 
month).format(parquet).mode(overwrite).saveAsTable(tbl)


Cheng

On 6/7/15 9:48 PM, patcharee wrote:

Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all 
datasets (very large) to the driver and use HiveContext there? It will 
be memory overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side. However, the closure passed to RDD.foreach is executed on 
executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey 
is to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). 
But the hiveContext.sql  throws NullPointerException, see below. Any 
suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + 
)  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org










-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark ML decision list

2015-06-07 Thread Debasish Das
What is decision list ? Inorder traversal (or some other traversal) of
fitted decision tree
On Jun 5, 2015 1:21 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote:

 Is there an existing way in SparkML to convert a decision tree to a
 decision list?

 On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh r...@databricks.com wrote:

 The closest algorithm to decision lists that we have is decision trees
 https://spark.apache.org/docs/latest/mllib-decision-tree.html

 On Thu, Jun 4, 2015 at 2:14 AM, Sateesh Kavuri sateesh.kav...@gmail.com
 wrote:

 Hi,

 I have used weka machine learning library for generating a model for my
 training set. I have used the PART algorithm (decision lists) from weka.

 Now, I would like to use spark ML for the PART algo for my training set
 and could not seem to find a parallel. Could anyone point out the
 corresponding algorithm or even if its available in Spark ML?

 Thanks,
 Sateesh






Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-07 Thread Cheng Lian

For the following code:

val df = sqlContext.parquetFile(path)

`df` remains columnar (actually it just reads from the columnar Parquet 
file on disk). For the following code:


val cdf = df.cache()

`cdf` is also columnar but that's different from Parquet. When a 
DataFrame is cached, Spark SQL turns it into a private in-memory 
columnar format.


So for your last question, the answer is: yes.

Cheng

On 6/3/15 10:58 PM, lonikar wrote:

When spark reads parquet files (sqlContext.parquetFile), it creates a
DataFrame RDD. I would like to know if the resulting DataFrame has columnar
structure (many rows of a column coalesced together in memory) or its a row
wise structure that a spark RDD has. The section  Spark SQL and DataFrames
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
says you need to call sqlContext.cacheTable(tableName) or df.cache() to
make it columnar. What exactly is this columnar structure?

To be precise: What does the row represent in the expression
df.cache().map{row = ...}?

Is it a logical row which maintains an array of columns and each column in
turn is an array of values for batchSize rows?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet-or-ORC-files-tp23139.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-07 Thread Cheng Lian

Were you using HiveContext.setConf()?

dfs.replication is a Hadoop configuration, but setConf() is only used 
to set Spark SQL specific configurations. You may either set it in your 
Hadoop core-site.xml.


Cheng


On 6/2/15 2:28 PM, Haopu Wang wrote:

Hi,

I'm trying to save SparkSQL DataFrame to a persistent Hive table using
the default parquet data source.

I don't know how to change the replication factor of the generated
parquet files on HDFS.

I tried to set dfs.replication on HiveContext but that didn't work.
Any suggestions are appreciated very much!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException SQLConf.setConf

2015-06-07 Thread Cheng Lian
Are you calling hiveContext.sql within an RDD.map closure or something 
similar? In this way, the call actually happens on executor side. 
However, HiveContext only exists on the driver side.


Cheng

On 6/4/15 3:45 PM, patcharee wrote:

Hi,

I am using Hive 0.14 and spark 0.13. I got 
java.lang.NullPointerException when inserted into hive. Any 
suggestions please.


hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE 
+ ,z= + zz + ,year= + YEAR + ,month= + MONTH + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim where 
z= + zz);


java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.setConf(SQLConf.scala:196)
at org.apache.spark.sql.SQLContext.setConf(SQLContext.scala:74)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)

at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:110)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:107)

at scala.collection.immutable.Range.foreach(Range.scala:141)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Monitoring Spark Jobs

2015-06-07 Thread Akhil Das
It could be a CPU, IO, Network bottleneck, you need to figure out where
exactly its chocking. You can use certain monitoring utilities (like top)
to understand it better.

Thanks
Best Regards

On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti samya.maiti2...@gmail.com
wrote:

 Hi All,

 I have a Spark SQL application to fetch data from Hive, on top I have a
 akka
 layer to run multiple Queries in parallel.

 *Please suggest a mechanism, so as to figure out the number of spark jobs
 running in the cluster at a given instance of time. *

 I need to do the above as, I see the average response time increasing with
 increase in number of requests, in-spite of increasing the number of cores
 in the cluster. I suspect there is a bottleneck somewhere else.

 Regards,
 Sam



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-Spark-Jobs-tp23193.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Problem reading Parquet from 1.2 to 1.3

2015-06-07 Thread Cheng Lian
This issue has been fixed recently in Spark 1.4 
https://github.com/apache/spark/pull/6581


Cheng

On 6/5/15 12:38 AM, Marcelo Vanzin wrote:
I talked to Don outside the list and he says that he's seeing this 
issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like 
there is a real issue here.


On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com 
mailto:dondr...@gmail.com wrote:


As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I
noticed that Spark is behaving differently when reading Parquet
directories that contain a .metadata directory.

It seems that in spark 1.2.x, it would just ignore the .metadata
directory, but now that I'm using Spark 1.3, reading these files
causes the following exceptions:

scala val d = sqlContext.parquetFile(/user/ddrak/parq_dir)

SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder.

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
further details.

scala.collection.parallel.CompositeThrowable: Multiple exceptions
thrown during a parallel computation: java.lang.RuntimeException:
hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is
not a Parquet file. expected magic number at tail [80, 65, 82, 49]
but found [116, 34, 10, 125]

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)


org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)


org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)


scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

.

.

.

java.lang.RuntimeException:
hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc
is not a Parquet file. expected magic number at tail [80, 65, 82,
49] but found [116, 34, 10, 125]

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)


org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)


org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)


scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

.

.

.

java.lang.RuntimeException:
hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties
is not a Parquet file. expected magic number at tail [80, 65, 82,
49] but found [117, 101, 116, 10]

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427)

parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398)


org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276)


org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275)

scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)


scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)

scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)

scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

.

.

.

at
scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)

at
scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)

at

scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)

at
scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)

at
scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)

at

scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)

at


Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread Cheng Lian
Interesting, just posted on another thread asking exactly the same 
question :) My answer there quoted below:


 For the following code:

 val df = sqlContext.parquetFile(path)

 `df` remains columnar (actually it just reads from the columnar 
Parquet file on disk). For the following code:


 val cdf = df.cache()

 `cdf` is also columnar but that's different from Parquet. When a 
DataFrame is cached, Spark SQL turns it into a private in-memory 
columnar format.


 So for your last question, the answer is: yes.

Some more details about the in-memory columnar structure: it's columnar, 
but much simpler than the one Parquet uses. The columnar byte arrays are 
split into batches with a fixed row count (configured by  
spark.sql.inMemoryColumnarStorage.batchSize). Also, each column is 
compressed with a compression scheme chose according to the data type 
and statistics information of that column. Supported compression schemes 
include RLE, DeltaInt, DeltaLong, BooleanBitSet, and DictionaryEncoding.


You may find the implementation here: 
https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar


Cheng

On 6/3/15 10:40 PM, kiran lonikar wrote:
When spark reads parquet files (sqlContext.parquetFile), it creates a 
DataFrame RDD. I would like to know if the resulting DataFrame has 
columnar structure (many rows of a column coalesced together in 
memory) or its a row wise structure that a spark RDD has. The section 
Spark SQL and DataFrames 
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory says 
you need to call sqlContext.cacheTable(tableName) or df.cache() to 
make it columnar. What exactly is this columnar structure?


To be precise: What does the row represent in the expression 
df.cache().map{row = ...}?


Is it a logical row which maintains an array of columns and each 
column in turn is an array of values for batchSize rows?


-Kiran





Re: Spark Streaming Stuck After 10mins Issue...

2015-06-07 Thread Cody Koeninger
What is the code used to set up the kafka stream?

On Sat, Jun 6, 2015 at 3:23 PM, EH eas...@gmail.com wrote:

 And here is the Thread Dump, where seems every worker is waiting for
 Executor
 #6 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) to
 be complete:

 Thread 41: BLOCK_MANAGER cleanup timer (WAITING)
 Thread 42: BROADCAST_VARS cleanup timer (WAITING)
 Thread 44: shuffle-client-0 (RUNNABLE)
 Thread 45: shuffle-server-0 (RUNNABLE)
 Thread 47: Driver Heartbeater (TIMED_WAITING)
 Thread 48: Executor task launch worker-0 (RUNNABLE)
 Thread 56: threadDeathWatcher-2-1 (TIMED_WAITING)
 Thread 81: sparkExecutor-akka.actor.default-dispatcher-18 (WAITING)
 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE)
 **
 sun.management.ThreadImpl.dumpThreads0(Native Method)
 sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:446)
 org.apache.spark.util.Utils$.getThreadDump(Utils.scala:1777)

 org.apache.spark.executor.ExecutorActor$$anonfun$receiveWithLogging$1.applyOrElse(ExecutorActor.scala:38)

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
 akka.actor.Actor$class.aroundReceive(Actor.scala:465)

 org.apache.spark.executor.ExecutorActor.aroundReceive(ExecutorActor.scala:34)
 akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 akka.actor.ActorCell.invoke(ActorCell.scala:487)
 akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 akka.dispatch.Mailbox.run(Mailbox.scala:220)

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 **
 Thread 112: sparkExecutor-akka.actor.default-dispatcher-25 (WAITING)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Stuck-After-10mins-Issue-tp23189p23190.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Accumulator map

2015-06-07 Thread Akhil Das
​Another approach would be to use a zookeeper. If you have zookeeper
running somewhere in the cluster you can simply create a path like
*/dynamic-list*​ in it and then write objects/values to it, you can even
create/access nested objects.

Thanks
Best Regards

On Fri, Jun 5, 2015 at 7:06 PM, Cosmin Cătălin Sanda 
cosmincata...@gmail.com wrote:

 Hi,

 I am trying to gather some statistics from an RDD using accumulators.
 Essentially, I am counting how many times specific segments appear in each
 row of the RDD. This works fine and I get the expected results, but the
 problem is that each time I add a new segment to look for, I have to
 explicitly create an Accumulator for it and explicitly use the Accumulator
 in the foreach method.

 Is there a way to use a dynamic list of Accumulators in Spark? I want to
 control the segments from a single place in the code and the accumulators
 to be dynamically created and used based on the metrics list.

 BR,
 
 *Cosmin Catalin SANDA*
 Software Systems Engineer
 Phone: +45.27.30.60.35




Re: Spark Streaming Stuck After 10mins Issue...

2015-06-07 Thread Akhil Das
Which consumer are you using? If you can paste the complete code then may
be i can try reproducing it.

Thanks
Best Regards

On Sun, Jun 7, 2015 at 1:53 AM, EH eas...@gmail.com wrote:

 And here is the Thread Dump, where seems every worker is waiting for
 Executor
 #6 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) to
 be complete:

 Thread 41: BLOCK_MANAGER cleanup timer (WAITING)
 Thread 42: BROADCAST_VARS cleanup timer (WAITING)
 Thread 44: shuffle-client-0 (RUNNABLE)
 Thread 45: shuffle-server-0 (RUNNABLE)
 Thread 47: Driver Heartbeater (TIMED_WAITING)
 Thread 48: Executor task launch worker-0 (RUNNABLE)
 Thread 56: threadDeathWatcher-2-1 (TIMED_WAITING)
 Thread 81: sparkExecutor-akka.actor.default-dispatcher-18 (WAITING)
 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE)
 **
 sun.management.ThreadImpl.dumpThreads0(Native Method)
 sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:446)
 org.apache.spark.util.Utils$.getThreadDump(Utils.scala:1777)

 org.apache.spark.executor.ExecutorActor$$anonfun$receiveWithLogging$1.applyOrElse(ExecutorActor.scala:38)

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)

 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
 akka.actor.Actor$class.aroundReceive(Actor.scala:465)

 org.apache.spark.executor.ExecutorActor.aroundReceive(ExecutorActor.scala:34)
 akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 akka.actor.ActorCell.invoke(ActorCell.scala:487)
 akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 akka.dispatch.Mailbox.run(Mailbox.scala:220)

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 **
 Thread 112: sparkExecutor-akka.actor.default-dispatcher-25 (WAITING)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Stuck-After-10mins-Issue-tp23189p23190.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Driver crash at the end with InvocationTargetException when running SparkPi

2015-06-07 Thread Dong Lei
Hi spark users:

After I submitted a SparkPi job to spark, the driver crashed at the end of the 
job with the following log:

WARN EventLoggingListener: Event log dir 
file:/d:/data/SparkWorker/work/driver-20150607200517-0002/logs/event does not 
exists, will newly create one.
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:59)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at 
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:135)
at org.apache.spark.SparkContext.init(SparkContext.scala:401)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)

From the log, I can see that the driver has added jars from HDFS, connected to 
master, scheduled executors and all the executors were running. And then this 
error occurred.

The command I use to submit job(I'm running spark 1.3.1 with standalone mode on 
windows):
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://localhost:7077 \
  --deploy-mode cluster
Hdfs://localhost:443/spark-examples-1.3.1-hadoop2.4.0.jar \
  1000


Any ideas about the error?
I've found a similar error in JIRA 
https://issues.apache.org/jira/browse/SPARK-1407 but It only occurred at 
FileLogger when using yarn and eventlog set to HDFS. In my case, I use 
standalone mode and event log set to local, and my error is caused by 
Hadoop.util.Shell.runCommand.


Best Regards
Dong Lei


FlatMap in DataFrame

2015-06-07 Thread dimple
Hi,
I'm trying to write a custom transformer in Spark ML and since that uses
DataFrames, am trying to use flatMap function in DataFrame class in Java.
Can you share a simple example of how to use the flatMap function to do word
count on single column of the DataFrame. Thanks.

Dimple



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/FlatMap-in-DataFrame-tp23199.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread kiran lonikar
Thanks for replying twice :) I think I sent this question by email and
somehow thought I did not sent it, hence created the other one on the web
interface. Lets retain this thread since you have provided more details
here.

Great, it confirms my intuition about DataFrame. It's similar to Shark
columnar layout, with the addition of compression. There it used java nio's
ByteBuffer to hold actual data. I will go through the code you pointed.

I have another question about DataFrame: The RDD operations are divided in
two groups: *transformations *which are lazily evaluated and return a new
RDD and *actions *which evaluate lineage defined by transformations, invoke
actions and return results. What about DataFrame operations like join,
groupBy, agg, unionAll etc which are all transformations in RDD? Are they
lazily evaluated or immediately executed?


Examples of flatMap in dataFrame

2015-06-07 Thread Dimp Bhat
Hi,
I'm trying to write a custom transformer in Spark ML and since that uses
DataFrames, am trying to use flatMap function in DataFrame class in Java.
Can you share a simple example of how to use the flatMap function to do
word count on single column of the DataFrame. Thanks


Dimple


Optimization module in Python mllib

2015-06-07 Thread martingoodson
Am I right in thinking that Python mllib does not contain the optimization
module? Are there plans to add this to the Python api?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimization-module-in-Python-mllib-tp23191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side. However, the closure passed to RDD.foreach is executed on executor 
side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey is 
to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But 
the hiveContext.sql  throws NullPointerException, see below. Any 
suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org