Re: Can't access remote Hive table from spark

2015-02-01 Thread guxiaobo1982
One friend told me that I should add the hive-site.xml file to the --files 
option of spark-submit command, but how can I run and debug my program inside 
eclipse?






-- Original --
From:  "guxiaobo1982";;
Send time: Sunday, Feb 1, 2015 4:18 PM
To: "Jörn Franke"; 

Subject:  Re: Can't access remote Hive table from spark



I am sorry , i forget to say that I have created the table manually .


在 2015年2月1日,下午4:14,Jörn Franke  写道:



You commented the line which is suppose to create a table.
 Le 25 janv. 2015 09:20, "guxiaobo1982"  a écrit :
Hi,
I built and started a single node standalone Spark 1.2.0 cluster along with a 
single node Hive 0.14.0 instance installed by Ambari 1.17.0. On the Spark and 
Hive node I can create and query tables inside Hive, and on remote machines I 
can submit the SparkPi example to the Spark master. But I failed to run the 
following example code :


 
public class SparkTest {
 
public static void main(String[] args)
 
{
 
String appName= "This is a test application";
 
String master="spark://lix1.bh.com:7077";
 

 
SparkConf conf = new 
SparkConf().setAppName(appName).setMaster(master);
 
JavaSparkContext sc = new JavaSparkContext(conf);
 

 
JavaHiveContext sqlCtx = new 
org.apache.spark.sql.hive.api.java.JavaHiveContext(sc);
 
//sqlCtx.sql("CREATE TABLE IF NOT EXISTS src (key INT, value 
STRING)");
 
//sqlCtx.sql("LOAD DATA LOCAL INPATH 
'/opt/spark/examples/src/main/resources/kv1.txt' INTO TABLE src");
 
// Queries are expressed in HiveQL.
 
List rows = sqlCtx.sql("FROM src SELECT key, value").collect();
 
System.out.print("I got " + rows.size() + " rows \r\n");
 
sc.close();}
 
}




Exception in thread "main" 
org.apache.hadoop.hive.ql.metadata.InvalidTableException: Table not found src

at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:980)

at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)

at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:70)

at 
org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)

at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:141)

at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:141)

at scala.Option.getOrElse(Option.scala:120)

at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:141)

at 
org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:253)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)

at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

at scala.collection.AbstractIterator.to(Iterator.scala:1157)

at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:138)

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:137)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)

at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.ap

spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread guxiaobo1982
Hi,


To order to let a local spark-shell connect to  a remote spark stand-alone 
cluster and access  hive tables there, I must put the hive-site.xml file into 
the local spark installation's conf path, but spark-shell even can't import the 
default settings there, I found two errors:
 

 
  hive.metastore.client.connect.retry.delay
 
  5s
 

 

 
  hive.metastore.client.socket.timeout
 
  1800s
 


Spark-shell try to read 5s and 1800s and integers, they must be changed to 5 
and 1800 to let spark-shell work, It's suggested to be fixed in future versions.

Re: Spark SQL & Parquet - data are reading very very slow

2015-02-01 Thread Mick Davies
Dictionary encoding of Strings from Parquet now added and will be in 1.3.
This should reduce UTF to String decoding significantly


https://issues.apache.org/jira/browse/SPARK-5309 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21455.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is pair rdd join more efficient than regular rdd

2015-02-01 Thread Sunita Arvind
Hi All

We are joining large tables using spark sql and running into shuffle
issues. We have explored multiple options - using coalesce to reduce number
of partitions, tuning various parameters like disk buffer, reducing data in
chunks etc. which all seem to help btw. What I would like to know is,
is having a pair rdd over regular rdd one of the solutions ? Will it make
the joining more efficient as spark can shuffle better since it knows the
key? Logically speaking I think it should help but I haven't found any
evidence on the internet including the spark sql documentation.

It is a lot of effort for us to try this approach and weight the
performance as we need to register the output as tables to proceed using
them. Hence would appreciate inputs from the community before proceeding.


Regards
Sunita Koppar


running 2 spark applications in parallel on yarn

2015-02-01 Thread Tomer Benyamini
Hi all,

I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that
whenever I'm running a heavy computation job in parallel to other jobs
running, I'm getting these kind of exceptions:

* [task-result-getter-2] INFO  org.apache.spark.scheduler.TaskSetManager-
Lost task 820.0 in stage 175.0 (TID 11327) on executor xxx:
java.io.IOException (Failed to connect to xx:35194) [duplicate 12]

* org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 12

* org.apache.spark.shuffle.FetchFailedException: Failed to connect to
x:35194
Caused by: java.io.IOException: Failed to connect to x:35194

when running the heavy job alone on the cluster, I'm not getting any
errors. My guess is that spark contexts from different apps do not share
information about taken ports, and therefore collide on specific ports,
causing the job/stage to fail. Is there a way to assign a specific set of
executors to a specific spark job via "spark-submit", or is there a way to
define a range of ports to be used by the application?

Thanks!
Tomer


how to send JavaDStream RDD using foreachRDD using Java

2015-02-01 Thread sachin Singh
Hi I want to send streaming data to kafka topic,
I am having RDD data which I converted in JavaDStream ,now I want to send it
to kafka topic, I don't want kafka sending code, just I need foreachRDD
implementation, my code is look like as
public void publishtoKafka(ITblStream t)
{
MyTopicProducer MTP =
ProducerFactory.createProducer(hostname+":"+port);
JavaDStream rdd = (JavaDStream) t.getRDD();

rdd.foreachRDD(new Function() {
@Override
public Void call(JavaRDD rdd) throws Exception {
 KafkaUtils.sendDataAsString(MTP,topicName, "String RDDData");
return null;
}
  });
log.debug("sent to kafka:
--");

}   

here myTopicproducer will create producer which is working fine
KafkaUtils.sendDataAsString is method which will publish data to kafka topic
is also working fine,

I have only one problem I am not able to convert JavaDStream rdd as string
using foreach or foreachRDD finally I need String message from rdds, kindly
suggest java code only and I dont want to use anonymous classes, Please send
me only the part to send JavaDStream RDD using foreachRDD using Function
Call

Thanks in advance,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-send-JavaDStream-RDD-using-foreachRDD-using-Java-tp21456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
I may be missing something here but typically when the hive-site.xml
configurations do not require you to place "s" within the configuration
itself.  Both the retry.delay and socket.timeout values are in seconds so
you should only need to place the integer value (which are in seconds).

On Sun Feb 01 2015 at 2:28:09 AM guxiaobo1982  wrote:

> Hi,
>
> To order to let a local spark-shell connect to  a remote spark stand-alone
> cluster and access  hive tables there, I must put the hive-site.xml file
> into the local spark installation's conf path, but spark-shell even can't
> import the default settings there, I found two errors:
>
> 
>
>   hive.metastore.client.connect.retry.delay
>
>   5s
>
> 
>
> 
>
>   hive.metastore.client.socket.timeout
>
>   1800s
>
> 
> Spark-shell try to read 5s and 1800s and integers, they must be changed to
> 5 and 1800 to let spark-shell work, It's suggested to be fixed in future
> versions.
>


Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Ted Yu
Looking at common/src/java/org/apache/hadoop/hive/conf/HiveConf.java :


METASTORE_CLIENT_CONNECT_RETRY_DELAY("hive.metastore.client.connect.retry.delay",
"1s",
new TimeValidator(TimeUnit.SECONDS),
"Number of seconds for the client to wait between consecutive
connection attempts"),

It seems having the 's' suffix is legitimate.

On Sun, Feb 1, 2015 at 9:14 AM, Denny Lee  wrote:

> I may be missing something here but typically when the hive-site.xml
> configurations do not require you to place "s" within the configuration
> itself.  Both the retry.delay and socket.timeout values are in seconds so
> you should only need to place the integer value (which are in seconds).
>
>
> On Sun Feb 01 2015 at 2:28:09 AM guxiaobo1982  wrote:
>
>> Hi,
>>
>> To order to let a local spark-shell connect to  a remote spark
>> stand-alone cluster and access  hive tables there, I must put the
>> hive-site.xml file into the local spark installation's conf path, but
>> spark-shell even can't import the default settings there, I found two
>> errors:
>>
>> 
>>
>>   hive.metastore.client.connect.retry.delay
>>
>>   5s
>>
>> 
>>
>> 
>>
>>   hive.metastore.client.socket.timeout
>>
>>   1800s
>>
>> 
>> Spark-shell try to read 5s and 1800s and integers, they must be changed
>> to 5 and 1800 to let spark-shell work, It's suggested to be fixed in future
>> versions.
>>
>


Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
Cool!  For all the times i had been modifying the hive-site.xml I had only
propped in the integer values - learn something new every day, eh?!


On Sun Feb 01 2015 at 9:36:23 AM Ted Yu  wrote:

> Looking at common/src/java/org/apache/hadoop/hive/conf/HiveConf.java :
>
>
> METASTORE_CLIENT_CONNECT_RETRY_DELAY("hive.metastore.client.connect.retry.delay",
> "1s",
> new TimeValidator(TimeUnit.SECONDS),
> "Number of seconds for the client to wait between consecutive
> connection attempts"),
>
> It seems having the 's' suffix is legitimate.
>
> On Sun, Feb 1, 2015 at 9:14 AM, Denny Lee  wrote:
>
>> I may be missing something here but typically when the hive-site.xml
>> configurations do not require you to place "s" within the configuration
>> itself.  Both the retry.delay and socket.timeout values are in seconds so
>> you should only need to place the integer value (which are in seconds).
>>
>>
>> On Sun Feb 01 2015 at 2:28:09 AM guxiaobo1982 
>> wrote:
>>
>>> Hi,
>>>
>>> To order to let a local spark-shell connect to  a remote spark
>>> stand-alone cluster and access  hive tables there, I must put the
>>> hive-site.xml file into the local spark installation's conf path, but
>>> spark-shell even can't import the default settings there, I found two
>>> errors:
>>>
>>> 
>>>
>>>   hive.metastore.client.connect.retry.delay
>>>
>>>   5s
>>>
>>> 
>>>
>>> 
>>>
>>>   hive.metastore.client.socket.timeout
>>>
>>>   1800s
>>>
>>> 
>>> Spark-shell try to read 5s and 1800s and integers, they must be changed
>>> to 5 and 1800 to let spark-shell work, It's suggested to be fixed in future
>>> versions.
>>>
>>
>


Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
Spark 1.2

SchemaRDD has schema with decimal columns created like

x1 = new StructField("a", DecimalType(14,4), true)

x2 = new StructField("b", DecimalType(14,4), true)

Registering as SQL Temp table and doing SQL queries on these columns ,
including SUM etc. works fine, so the schema Decimal does not seems to be
issue

When doing saveAsParquetFile on the RDD, it gives following error. Not sure
why the "DecimalType" in SchemaRDD is not seen by Parquet, which seems to
see it as scala.math.BigDecimal

java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to
org.apache.spark.sql.catalyst.types.decimal.Decimal

at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(
ParquetTableSupport.scala:359)

at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(
ParquetTableSupport.scala:328)

at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(
ParquetTableSupport.scala:314)

at parquet.hadoop.InternalParquetRecordWriter.write(
InternalParquetRecordWriter.java:120)

at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)

at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)

at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(
ParquetTableOperations.scala:308)

at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(
ParquetTableOperations.scala:325)

at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(
ParquetTableOperations.scala:325)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:56)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:744)


Logstash as a source?

2015-02-01 Thread NORD SC
Hi,

I plan to have logstash send log events (as key value pairs) to spark streaming 
using Spark on Cassandra.

Being completely fresh to Spark, I have a couple of questions:

- is that a good idea at all, or would it be better to put e.g. Kafka in 
between to handle traffic peeks
  (IOW: how and how well would Spark Streaming handle peeks?)

- Is there already a logstash-source implementation for Spark Streaming 

- assuming there is none yet and assuming it is a good idea: I’d dive into 
writing it myself - what would the core advice be to avoid biginner traps?

Jan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Error when running spark in debug mode

2015-02-01 Thread Ankur Srivastava
I am running on m3.xlarge instances on AWS with 12 gb worker memory and 10
gb executor memory.

On Sun, Feb 1, 2015, 12:41 PM Arush Kharbanda 
wrote:

> What is the machine configuration you are running it on?
>
> On Mon, Feb 2, 2015 at 1:46 AM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> I am using the log4j.properties file which is shipped with SPARK
>> distribution in the conf directory.
>>
>> Thanks
>> Ankur
>>
>> On Sat, Jan 31, 2015, 12:36 AM Arush Kharbanda <
>> ar...@sigmoidanalytics.com> wrote:
>>
>>> Can you share your log4j file.
>>>
>>> On Sat, Jan 31, 2015 at 1:35 PM, Arush Kharbanda <
>>> ar...@sigmoidanalytics.com> wrote:
>>>
 Hi Ankur,

 Its running fine for me for spark 1.1 and changes to log4j properties
 file.

 Thanks
 Arush

 On Fri, Jan 30, 2015 at 9:49 PM, Ankur Srivastava <
 ankur.srivast...@gmail.com> wrote:

> Hi Arush
>
> I have configured log4j by updating the file log4j.properties in
> SPARK_HOME/conf folder.
>
> If it was a log4j defect we would get error in debug mode in all apps.
>
> Thanks
> Ankur
>  Hi Ankur,
>
> How are you enabling the debug level of logs. It should be a log4j
> configuration. Even if there would be some issue it would be in log4j and
> not in spark.
>
> Thanks
> Arush
>
> On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi,
>>
>> When ever I enable DEBUG level logs for my spark cluster, on running
>> a job all the executors die with the below exception. On disabling the
>> DEBUG logs my jobs move to the next step.
>>
>>
>> I am on spark-1.1.0
>>
>> Is this a known issue with spark?
>>
>> Thanks
>> Ankur
>>
>> 2015-01-29 22:27:42,467 [main] INFO  org.apache.spark.SecurityManager
>> - SecurityManager: authentication disabled; ui acls disabled; users with
>> view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
>>
>> 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils
>> - In createActorSystem, requireCookie is: off
>>
>> 2015-01-29 22:27:42,871
>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO
>> akka.event.slf4j.Slf4jLogger - Slf4jLogger started
>>
>> 2015-01-29 22:27:42,912
>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
>> Starting remoting
>>
>> 2015-01-29 22:27:43,057
>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
>> Remoting started; listening on addresses :[akka.tcp://
>> driverPropsFetcher@10.77.9.155:36035]
>>
>> 2015-01-29 22:27:43,060
>> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
>> Remoting now listens on addresses: [akka.tcp://
>> driverPropsFetcher@10.77.9.155:36035]
>>
>> 2015-01-29 22:27:43,067 [main] INFO  org.apache.spark.util.Utils -
>> Successfully started service 'driverPropsFetcher' on port 36035.
>>
>> 2015-01-29 22:28:13,077 [main] ERROR
>> org.apache.hadoop.security.UserGroupInformation -
>> PriviledgedActionException as:ubuntu
>> cause:java.util.concurrent.TimeoutException: Futures timed out after [30
>> seconds]
>>
>> Exception in thread "main"
>> java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
>>
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
>>
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
>>
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>>
>> Caused by: java.security.PrivilegedActionException:
>> java.util.concurrent.TimeoutException: Futures timed out after [30 
>> seconds]
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>>
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>>
>> ... 4 more
>>
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out
>> after [30 seconds]
>>
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>
>> at
>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>
>> at
>>

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
I think I found the issue causing it.

I was calling schemaRDD.coalesce(n).saveAsParquetFile to reduce the number
of partitions in parquet file - in which case the stack trace happens.

If I compress the partitions before creating schemaRDD then the
schemaRDD.saveAsParquetFile call works for decimal

So it seems schemaRDD.coalesce returns a RDD whose schema does not matches
the source RDD in that decimal type seem to get changed.

Any thoughts ? Is this a bug ???

Thanks,


On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel 
wrote:

> Spark 1.2
>
> SchemaRDD has schema with decimal columns created like
>
> x1 = new StructField("a", DecimalType(14,4), true)
>
> x2 = new StructField("b", DecimalType(14,4), true)
>
> Registering as SQL Temp table and doing SQL queries on these columns ,
> including SUM etc. works fine, so the schema Decimal does not seems to be
> issue
>
> When doing saveAsParquetFile on the RDD, it gives following error. Not
> sure why the "DecimalType" in SchemaRDD is not seen by Parquet, which seems
> to see it as scala.math.BigDecimal
>
> java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to
> org.apache.spark.sql.catalyst.types.decimal.Decimal
>
> at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(
> ParquetTableSupport.scala:359)
>
> at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(
> ParquetTableSupport.scala:328)
>
> at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(
> ParquetTableSupport.scala:314)
>
> at parquet.hadoop.InternalParquetRecordWriter.write(
> InternalParquetRecordWriter.java:120)
>
> at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>
> at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>
> at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
> $apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(
> ParquetTableOperations.scala:308)
>
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(
> ParquetTableOperations.scala:325)
>
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(
> ParquetTableOperations.scala:325)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>
>  at java.lang.Thread.run(Thread.java:744)
>
>
>
>


ClassNotFoundException when registering classes with Kryo

2015-02-01 Thread Arun Lists
Here is the relevant snippet of code in my main program:

===

sparkConf.set("spark.serializer",
  "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.kryo.registrationRequired", "true")
val summaryDataClass = classOf[SummaryData]
val summaryViewClass = classOf[SummaryView]
sparkConf.registerKryoClasses(Array(

  summaryDataClass, summaryViewClass))

===

I get the following error:

Exception in thread "main" java.lang.reflect.InvocationTargetException
...

Caused by: org.apache.spark.SparkException: Failed to load class to
register with Kryo
...

Caused by: java.lang.ClassNotFoundException:
com.dtex.analysis.transform.SummaryData


Note that the class in question SummaryData is in the same package as the
main program and hence in the same jar.

What do I need to do to make this work?

Thanks,
arun


Re: Logstash as a source?

2015-02-01 Thread Tsai Li Ming
I have been using a logstash alternative - fluentd to ingest the data into hdfs.

I had to configure fluentd to not append the data so that spark streaming will 
be able to pick up the new logs.

-Liming


On 2 Feb, 2015, at 6:05 am, NORD SC  wrote:

> Hi,
> 
> I plan to have logstash send log events (as key value pairs) to spark 
> streaming using Spark on Cassandra.
> 
> Being completely fresh to Spark, I have a couple of questions:
> 
> - is that a good idea at all, or would it be better to put e.g. Kafka in 
> between to handle traffic peeks
>  (IOW: how and how well would Spark Streaming handle peeks?)
> 
> - Is there already a logstash-source implementation for Spark Streaming 
> 
> - assuming there is none yet and assuming it is a good idea: I’d dive into 
> writing it myself - what would the core advice be to avoid biginner traps?
> 
> Jan
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: running 2 spark applications in parallel on yarn

2015-02-01 Thread Sandy Ryza
Hi Tomer,

Are you able to look in your NodeManager logs to see if the NodeManagers
are killing any executors for exceeding memory limits?  If you observe
this, you can solve the problem by bumping up
spark.yarn.executor.memoryOverhead.

-Sandy

On Sun, Feb 1, 2015 at 5:28 AM, Tomer Benyamini  wrote:

> Hi all,
>
> I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that
> whenever I'm running a heavy computation job in parallel to other jobs
> running, I'm getting these kind of exceptions:
>
> * [task-result-getter-2] INFO  org.apache.spark.scheduler.TaskSetManager-
> Lost task 820.0 in stage 175.0 (TID 11327) on executor xxx:
> java.io.IOException (Failed to connect to xx:35194) [duplicate 12]
>
> * org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
> location for shuffle 12
>
> * org.apache.spark.shuffle.FetchFailedException: Failed to connect to
> x:35194
> Caused by: java.io.IOException: Failed to connect
> to x:35194
>
> when running the heavy job alone on the cluster, I'm not getting any
> errors. My guess is that spark contexts from different apps do not share
> information about taken ports, and therefore collide on specific ports,
> causing the job/stage to fail. Is there a way to assign a specific set of
> executors to a specific spark job via "spark-submit", or is there a way to
> define a range of ports to be used by the application?
>
> Thanks!
> Tomer
>
>
>


Union in Spark

2015-02-01 Thread Deep Pradhan
Hi,
Is there any better operation than Union. I am using union and the cluster
is getting stuck with a large data set.

Thank you


Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep,

what do you mean by stuck?

Jerry

On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan 
wrote:

> Hi,
> Is there any better operation than Union. I am using union and the cluster
> is getting stuck with a large data set.
>
> Thank you
>


Re: Window comparison matching using the sliding window functionality: feasibility

2015-02-01 Thread ashu
Hi,
I want to know about your moving avg implementation. I am also doing some
time-series analysis about CPU performance. So I tried simple regression but
result is not good. rmse is 10 but when I extrapolate it just shoot up
linearly. I think I should first smoothed out the data then try regression
to forecast.
i am thinking of moving avg as an option,tried it out according to this
http://stackoverflow.com/questions/23402303/apache-spark-moving-average

but "partitionBy" is giving me error, I am building with Spark 1.2.0.
Can you share your ARIMA implementation if it is open source, else can you
give me hints about it

Will really appreciate the help.
Thanks 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p21458.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Union in Spark

2015-02-01 Thread Deep Pradhan
The cluster hangs.

On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam  wrote:

> Hi Deep,
>
> what do you mean by stuck?
>
> Jerry
>
>
> On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan 
> wrote:
>
>> Hi,
>> Is there any better operation than Union. I am using union and the
>> cluster is getting stuck with a large data set.
>>
>> Thank you
>>
>
>


Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep,

How do you know the cluster is not responsive because of "Union"?
Did you check the spark web console?

Best Regards,

Jerry


On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan 
wrote:

> The cluster hangs.
>
> On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam  wrote:
>
>> Hi Deep,
>>
>> what do you mean by stuck?
>>
>> Jerry
>>
>>
>> On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan 
>> wrote:
>>
>>> Hi,
>>> Is there any better operation than Union. I am using union and the
>>> cluster is getting stuck with a large data set.
>>>
>>> Thank you
>>>
>>
>>
>


Connection closed/reset by peers error

2015-02-01 Thread Kartheek.R
Hi,

I keep facing this error when I run my application:

java.io.IOException: Connection from s1/- closed +details

java.io.IOException: Connection from s1/:43741 closed
at 
org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:98)
at 
org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:81)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:183)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:169)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:738)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$6.run(AbstractChannel.java:606)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)


I have a 6 node cluster. I set executor memory = 5G as this is the
max. memory the weakest machine in the cluster can afford.

Any help please?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Connection-closed-reset-by-peers-error-tp21459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Union in Spark

2015-02-01 Thread Deep Pradhan
I did not check the console because once the job starts I cannot run
anything else and have to force shutdown the system. I commented parts of
codes and I tested. I doubt it is because of union. So, I want to change it
to something else and see if the problem persists.

Thank you

On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam  wrote:

> Hi Deep,
>
> How do you know the cluster is not responsive because of "Union"?
> Did you check the spark web console?
>
> Best Regards,
>
> Jerry
>
>
> On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan 
> wrote:
>
>> The cluster hangs.
>>
>> On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam  wrote:
>>
>>> Hi Deep,
>>>
>>> what do you mean by stuck?
>>>
>>> Jerry
>>>
>>>
>>> On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan >> > wrote:
>>>
 Hi,
 Is there any better operation than Union. I am using union and the
 cluster is getting stuck with a large data set.

 Thank you

>>>
>>>
>>
>


Re: ClassNotFoundException when registering classes with Kryo

2015-02-01 Thread Shixiong Zhu
It's a bug that has been fixed in https://github.com/apache/spark/pull/4258
but not yet been merged.

Best Regards,
Shixiong Zhu

2015-02-02 10:08 GMT+08:00 Arun Lists :

> Here is the relevant snippet of code in my main program:
>
> ===
>
> sparkConf.set("spark.serializer",
>   "org.apache.spark.serializer.KryoSerializer")
> sparkConf.set("spark.kryo.registrationRequired", "true")
> val summaryDataClass = classOf[SummaryData]
> val summaryViewClass = classOf[SummaryView]
> sparkConf.registerKryoClasses(Array(
>
>   summaryDataClass, summaryViewClass))
>
> ===
>
> I get the following error:
>
> Exception in thread "main" java.lang.reflect.InvocationTargetException
> ...
>
> Caused by: org.apache.spark.SparkException: Failed to load class to
> register with Kryo
> ...
>
> Caused by: java.lang.ClassNotFoundException:
> com.dtex.analysis.transform.SummaryData
>
>
> Note that the class in question SummaryData is in the same package as the
> main program and hence in the same jar.
>
> What do I need to do to make this work?
>
> Thanks,
> arun
>
>
>


Re: ClassNotFoundException when registering classes with Kryo

2015-02-01 Thread Arun Lists
Thanks for the notification!

For now, I'll use the Kryo serializer without registering classes until the
bug fix has been merged into the next version of Spark (I guess that will
be 1.3, right?).

arun


On Sun, Feb 1, 2015 at 10:58 PM, Shixiong Zhu  wrote:

> It's a bug that has been fixed in
> https://github.com/apache/spark/pull/4258 but not yet been merged.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-02-02 10:08 GMT+08:00 Arun Lists :
>
>> Here is the relevant snippet of code in my main program:
>>
>> ===
>>
>> sparkConf.set("spark.serializer",
>>   "org.apache.spark.serializer.KryoSerializer")
>> sparkConf.set("spark.kryo.registrationRequired", "true")
>> val summaryDataClass = classOf[SummaryData]
>> val summaryViewClass = classOf[SummaryView]
>> sparkConf.registerKryoClasses(Array(
>>
>>   summaryDataClass, summaryViewClass))
>>
>> ===
>>
>> I get the following error:
>>
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>> ...
>>
>> Caused by: org.apache.spark.SparkException: Failed to load class to
>> register with Kryo
>> ...
>>
>> Caused by: java.lang.ClassNotFoundException:
>> com.dtex.analysis.transform.SummaryData
>>
>>
>> Note that the class in question SummaryData is in the same package as the
>> main program and hence in the same jar.
>>
>> What do I need to do to make this work?
>>
>> Thanks,
>> arun
>>
>>
>>
>


Re: Union in Spark

2015-02-01 Thread Arush Kharbanda
Hi Deep,

What is your configuration and what is the size of the 2 data sets?

Thanks
Arush

On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan 
wrote:

> I did not check the console because once the job starts I cannot run
> anything else and have to force shutdown the system. I commented parts of
> codes and I tested. I doubt it is because of union. So, I want to change it
> to something else and see if the problem persists.
>
> Thank you
>
> On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam  wrote:
>
>> Hi Deep,
>>
>> How do you know the cluster is not responsive because of "Union"?
>> Did you check the spark web console?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan 
>> wrote:
>>
>>> The cluster hangs.
>>>
>>> On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam  wrote:
>>>
 Hi Deep,

 what do you mean by stuck?

 Jerry


 On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan <
 pradhandeep1...@gmail.com> wrote:

> Hi,
> Is there any better operation than Union. I am using union and the
> cluster is getting stuck with a large data set.
>
> Thank you
>


>>>
>>
>


-- 

[image: Sigmoid Analytics] 

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Union in Spark

2015-02-01 Thread Deep Pradhan
The configuration is 16GB ram and 1TB HD. have a single node Spark cluster.
Even after setting driver memory to 5g and executor memory to 3g, I get
this error. The size of the data set is 350 KB and the set that it works
well is hardly few KBs.

On Mon, Feb 2, 2015 at 1:18 PM, Arush Kharbanda 
wrote:

> Hi Deep,
>
> What is your configuration and what is the size of the 2 data sets?
>
> Thanks
> Arush
>
> On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan 
> wrote:
>
>> I did not check the console because once the job starts I cannot run
>> anything else and have to force shutdown the system. I commented parts of
>> codes and I tested. I doubt it is because of union. So, I want to change it
>> to something else and see if the problem persists.
>>
>> Thank you
>>
>> On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam  wrote:
>>
>>> Hi Deep,
>>>
>>> How do you know the cluster is not responsive because of "Union"?
>>> Did you check the spark web console?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>>
>>> On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan 
>>> wrote:
>>>
 The cluster hangs.

 On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam 
 wrote:

> Hi Deep,
>
> what do you mean by stuck?
>
> Jerry
>
>
> On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan <
> pradhandeep1...@gmail.com> wrote:
>
>> Hi,
>> Is there any better operation than Union. I am using union and the
>> cluster is getting stuck with a large data set.
>>
>> Thank you
>>
>
>

>>>
>>
>
>
> --
>
> [image: Sigmoid Analytics] 
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>