Re: AVRO vs Parquet

2016-03-10 Thread Guru Medasani
Thanks Michael for clarifying this. My response is  inline.

Guru Medasani
gdm...@gmail.com

> On Mar 10, 2016, at 12:38 PM, Michael Armbrust <mich...@databricks.com> wrote:
> 
> A few clarifications:
>  
> 1) High memory and cpu usage. This is because Parquet files can't be streamed 
> into as records arrive. I have seen a lot of OOMs in reasonably sized 
> MR/Spark containers that write out Parquet. When doing dynamic partitioning, 
> where many writers are open at once, we’ve seen customers having trouble to 
> make it work. This has made for some very confused ETL developers.
> 
> In Spark 1.6.1 we avoid having more than 2 files open per task, so this 
> should be less of a problem even for dynamic partitioning.

Thanks for fixing this. Looks like this the Jira that is going into Spark 1.6.1 
that is fixing the memory issues during dynamic partitioning. I copied it here 
so rest of the folks on the email thread can take a look. 

SPARK-12546 <https://issues.apache.org/jira/browse/SPARK-12546>

Writing to partitioned parquet table can fail with OOM


>  
> 2) Parquet lags well behind Avro in schema evolution semantics. Can only add 
> columns at the end? Deleting columns at the end is not recommended if you 
> plan to add any columns in the future. Reordering is not supported in current 
> release. 
> 
> This may be true for Impala, but Spark SQL does schema merging by name so you 
> can add / reorder columns with the constraint that you cannot reuse a name 
> with an incompatible type.

As I mentioned in my previous email it is something user still needs to be 
aware of as user mentioned the following in the initial question.

I also have to consider any file on HDFS may be accessed from other tools like 
Hive, Impala, HAWQ.

Regarding the reordering I also mentioned in my previous email that it will be 
supported in Impala in the next release as well. 

In the next release, there might be support for optionally matching Parquet 
file columns by name instead of order, like Hive does. Under this scheme, you 
cannot rename columns (since the files will retain the old name and will no 
longer be matched), but you can reorder them. ( This is regarding Impala)

Re: AVRO vs Parquet

2016-03-09 Thread Guru Medasani
+1 Paul. Both have some pros and cons.

Hope this helps.

Avro:

Pros: 
1) Plays nice with other tools, 3rd party or otherwise, or you specifically 
need some data type in AVRO like binary, but gladly that list is shrinking all 
the time (yay nested types in Impala).
2) Good for event data that changes over time. Can be used in Kafka Schema 
Registry
3) Schema Evolution - supports adding, removing, renaming, reordering


Cons:
1) Cannot write data from Impala
2) Timestamp data type is not supported (Hive & Impala)-  we can still store 
them as longs and convert it in the end-user tables.

Parquet:

Pros:
1) High performance during both reads and writes - both wide and narrow tables 
(though mileage may vary based on the datasets cardinality)
2) Great for analytics

Cons:
1) High memory and cpu usage. This is because Parquet files can't be streamed 
into as records arrive. I have seen a lot of OOMs in reasonably sized MR/Spark 
containers that write out Parquet. When doing dynamic partitioning, where many 
writers are open at once, we’ve seen customers having trouble to make it work. 
This has made for some very confused ETL developers.
2) Parquet lags well behind Avro in schema evolution semantics. Can only add 
columns at the end? Deleting columns at the end is not recommended if you plan 
to add any columns in the future. Reordering is not supported in current 
release. 

Here is Schema Evolution and best practices in Parquet and Avro in more detail. 

Parquet you can add columns to the end of the table definition and they'll be 
populated with NULL if missing in the file (you can technically delete columns 
from the end too, but then it will try to read the old deleted column if you 
add a new column to the end afterwards). 

Here is an example of how this works.

At t0 - if we have a parquet table with the following columns c1,c2,c3,c4
 t1 - if we delete the columns c3, c4, table now has the columns c1,c2
 t2 - Now we add a new column c5, table now has the columns c1,c2,c5

So now when we query the table for columns c1,c2 and c5, it will try to read 
the columns c3.

Impala currently matches the file schema to the table schema by column order, 
so at t2, if you select c5, Impala will try to read c3 from the old files, 
thinking it's c5 since it's the third column in the file. This will work if the 
types of c3 and c5 are the same (but probably not be the right results) or fail 
if the types don't match. It will ignore the old c4. Files with the new schema 
(c1,c2,c5) will be read correctly.

You can also rename columns, since we only look at the column ordering and not 
their names. So if you have columns c1 and c2, and reorder them c2 and c1, 
Impala will read c1 when you query c2 and vice versa.

In the next release, there might be support for optionally matching Parquet 
file columns by name instead of order, like Hive does. Under this scheme, you 
cannot rename columns (since the files will retain the old name and will no 
longer be matched), but you can reorder them. ( This is regarding Impala)

In Avro we follow the rules described here: 
http://avro.apache.org/docs/1.8.0/spec.html#Schema+Resolution 
<http://avro.apache.org/docs/1.8.0/spec.html#Schema+Resolution>

This includes rearranging columns, renaming columns via aliasing, and certain 
data type differences (e.g. you can "promote" an int to a bigint in the table 
schema, but not vice versa). AFAIK you should always modify the Avro schema 
directly in the table metadata, I think weird stuff happens if you try to use 
the usual ALTER TABLE commands. Preferred method is to modify the json schema 
file on HDFS and then using the ALTER TABLE to add/remove/reorder/rename the 
columns.

I think in some cases using ALTER TABLE without changing the schema could work 
by chance, but in general it seems dangerous to change the column metadata in 
the metastore and not have that be reflected in the Avro schema as well.

Guru Medasani
gdm...@gmail.com



> On Mar 4, 2016, at 7:36 AM, Paul Leclercq <paul.lecle...@tabmo.io> wrote:
> 
> 
> 
> Nice article about Parquet with Avro : 
> https://dzone.com/articles/understanding-how-parquet 
> <https://dzone.com/articles/understanding-how-parquet>
> http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ 
> <http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/>
> Nice video from the good folks of Cloudera for the differences between 
> "Avrow" and Parquet
> https://www.youtube.com/watch?v=AY1dEfyFeHc 
> <https://www.youtube.com/watch?v=AY1dEfyFeHc>
> 
> 2016-03-04 7:12 GMT+01:00 Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>>:
> well can you use orc without bringing in the kitchen sink of dependencies 
> also known as hive?
> 
> On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim <ilike...@gmail.com 
> <mailto:ilike...@gmail.com>> wrote:
&g

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Guru Medasani
Hi Yanlin,

This is a fairly new effort and is not officially released/supported by 
Cloudera yet. I believe those numbers will be out once it is released.

Guru Medasani
gdm...@gmail.com



> On Mar 2, 2016, at 10:40 AM, yanlin wang <yanl...@me.com> wrote:
> 
> Did any one use Livy in real world high concurrency web app? I think it uses 
> spark submit command line to create job... How about  job server or notebook 
> comparing with Livy?
> 
> Thx,
> Yanlin
> 
> Sent from my iPhone
> 
> On Mar 2, 2016, at 6:24 AM, Guru Medasani <gdm...@gmail.com 
> <mailto:gdm...@gmail.com>> wrote:
> 
>> Hi Don,
>> 
>> Here is another REST interface for interacting with Spark from anywhere. 
>> 
>> https://github.com/cloudera/livy <https://github.com/cloudera/livy>
>> 
>> Here is an example to estimate PI using Spark from Python using requests 
>> library. 
>> 
>> >>> data = {
>> ...   'code': textwrap.dedent("""\
>> ...  val NUM_SAMPLES = 10;
>> ...  val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>
>> ...val x = Math.random();
>> ...val y = Math.random();
>> ...if (x*x + y*y < 1) 1 else 0
>> ...  }.reduce(_ + _);
>> ...  println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES)
>> ...  """)
>> ... }
>> >>> r = requests.post(statements_url, data=json.dumps(data), headers=headers)
>> >>> pprint.pprint(r.json())
>> {u'id': 1,
>>  u'output': {u'data': {u'text/plain': u'Pi is roughly 3.14004\nNUM_SAMPLES: 
>> Int = 10\ncount: Int = 78501'},
>>  u'execution_count': 1,
>>  u'status': u'ok'},
>>  u'state': u'available'}
>> 
>> 
>> Guru Medasani
>> gdm...@gmail.com <mailto:gdm...@gmail.com>
>> 
>> 
>> 
>>> On Mar 2, 2016, at 7:47 AM, Todd Nist <tsind...@gmail.com 
>>> <mailto:tsind...@gmail.com>> wrote:
>>> 
>>> Have you looked at Apache Toree, http://toree.apache.org/ 
>>> <http://toree.apache.org/>.  This was formerly the Spark-Kernel from IBM 
>>> but contributed to apache.
>>> 
>>> https://github.com/apache/incubator-toree 
>>> <https://github.com/apache/incubator-toree>
>>> 
>>> You can find a good overview on the spark-kernel here:
>>> http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/
>>>  
>>> <http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/>
>>> 
>>> Not sure if that is of value to you or not.
>>> 
>>> HTH.
>>> 
>>> -Todd
>>> 
>>> On Tue, Mar 1, 2016 at 7:30 PM, Don Drake <dondr...@gmail.com 
>>> <mailto:dondr...@gmail.com>> wrote:
>>> I'm interested in building a REST service that utilizes a Spark SQL Context 
>>> to return records from a DataFrame (or IndexedRDD?) and even add/update 
>>> records.
>>> 
>>> This will be a simple REST API, with only a few end-points.  I found this 
>>> example:
>>> 
>>> https://github.com/alexmasselot/spark-play-activator 
>>> <https://github.com/alexmasselot/spark-play-activator>
>>> 
>>> which looks close to what I am interested in doing.  
>>> 
>>> Are there any other ideas or options if I want to run this in a YARN 
>>> cluster?
>>> 
>>> Thanks.
>>> 
>>> -Don
>>> 
>>> -- 
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/ <http://www.drakeconsulting.com/>
>>> https://twitter.com/dondrake <http://www.maillaunder.com/>
>>> 800-733-2143 
>> 



Re: Building a REST Service with Spark back-end

2016-03-02 Thread Guru Medasani
Hi Don,

Here is another REST interface for interacting with Spark from anywhere. 

https://github.com/cloudera/livy <https://github.com/cloudera/livy>

Here is an example to estimate PI using Spark from Python using requests 
library. 

>>> data = {
...   'code': textwrap.dedent("""\
...  val NUM_SAMPLES = 10;
...  val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>
...val x = Math.random();
...val y = Math.random();
...if (x*x + y*y < 1) 1 else 0
...  }.reduce(_ + _);
...  println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES)
...  """)
... }
>>> r = requests.post(statements_url, data=json.dumps(data), headers=headers)
>>> pprint.pprint(r.json())
{u'id': 1,
 u'output': {u'data': {u'text/plain': u'Pi is roughly 3.14004\nNUM_SAMPLES: Int 
= 10\ncount: Int = 78501'},
 u'execution_count': 1,
 u'status': u'ok'},
 u'state': u'available'}


Guru Medasani
gdm...@gmail.com



> On Mar 2, 2016, at 7:47 AM, Todd Nist <tsind...@gmail.com> wrote:
> 
> Have you looked at Apache Toree, http://toree.apache.org/ 
> <http://toree.apache.org/>.  This was formerly the Spark-Kernel from IBM but 
> contributed to apache.
> 
> https://github.com/apache/incubator-toree 
> <https://github.com/apache/incubator-toree>
> 
> You can find a good overview on the spark-kernel here:
> http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/
>  
> <http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/>
> 
> Not sure if that is of value to you or not.
> 
> HTH.
> 
> -Todd
> 
> On Tue, Mar 1, 2016 at 7:30 PM, Don Drake <dondr...@gmail.com 
> <mailto:dondr...@gmail.com>> wrote:
> I'm interested in building a REST service that utilizes a Spark SQL Context 
> to return records from a DataFrame (or IndexedRDD?) and even add/update 
> records.
> 
> This will be a simple REST API, with only a few end-points.  I found this 
> example:
> 
> https://github.com/alexmasselot/spark-play-activator 
> <https://github.com/alexmasselot/spark-play-activator>
> 
> which looks close to what I am interested in doing.  
> 
> Are there any other ideas or options if I want to run this in a YARN cluster?
> 
> Thanks.
> 
> -Don
> 
> -- 
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/ <http://www.drakeconsulting.com/>
> https://twitter.com/dondrake <http://www.maillaunder.com/>
> 800-733-2143 



Re: Error in load hbase on spark

2015-10-09 Thread Guru Medasani
Hi Roy,

Here is a cloudera-labs project SparkOnHBase that makes it really simple to 
read HBase data into Spark.

https://github.com/cloudera-labs/SparkOnHBase 
<https://github.com/cloudera-labs/SparkOnHBase>

Link to blog that explains how to use the package.

http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ 
<http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/>

It also has been committed to HBase project now.

http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
 
<http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/>

HBase Jira link: https://issues.apache.org/jira/browse/HBASE-13992 
<https://issues.apache.org/jira/browse/HBASE-13992>


Guru Medasani
gdm...@gmail.com



> On Oct 8, 2015, at 9:29 PM, Roy Wang <roywang1...@163.com> wrote:
> 
> 
> I want to load hbase table into spark.
> JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD =
> sc.newAPIHadoopRDD(conf, TableInputFormat.class,
> ImmutableBytesWritable.class, Result.class);
> 
> *when call hBaseRDD.count(),got error.*
> 
> Caused by: java.lang.IllegalStateException: The input format instance has
> not been properly initialized. Ensure you call initializeTable either in
> your constructor or initialize method
>   at
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389)
>   at
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158)
>   ... 11 more
> 
> *But when job start,I can get these logs*
> 2015-10-09 09:17:00[main] WARN  TableInputFormatBase:447 - initializeTable
> called multiple times. Overwriting connection and table reference;
> TableInputFormatBase will not close these old references when done.
> 
> Does anyone know how does this happen?
> 
> Thanks! 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-load-hbase-on-spark-tp24986.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Guru Medasani
Hi Lan,

Reading the pull request below. Looks like you should be able to use the config 
to both drivers and executors. I would give it a try with the Spark-shell on 
Yarn client mode.

https://github.com/apache/spark/pull/3233 
<https://github.com/apache/spark/pull/3233>

Yarn's config option spark.yarn.user.classpath.first does not work the same way 
as
spark.files.userClassPathFirst; Yarn's version is a lot more dangerous, in that 
it
modifies the system classpath, instead of restricting the changes to the user's 
class
loader. So this change implements the behavior of the latter for Yarn, and 
deprecates
the more dangerous choice.

To be able to achieve feature-parity, I also implemented the option for drivers 
(the existing
option only applies to executors). So now there are two options, each 
controlling whether
to apply userClassPathFirst to the driver or executors. The old option was 
deprecated, and
aliased to the new one (spark.executor.userClassPathFirst).

The existing "child-first" class loader also had to be fixed. It didn't handle 
resources, and it
was also doing some things that ended up causing JVM errors depending on how 
things
were being called.


Guru Medasani
gdm...@gmail.com



> On Sep 15, 2015, at 9:33 AM, Lan Jiang <ljia...@gmail.com> wrote:
> 
> Steve,
> 
> Thanks for the input. You are absolutely right. When I use protobuf 2.6.1, I 
> also ran into method not defined errors. You suggest using Maven sharding 
> strategy, but I have already built the uber jar to package all my custom 
> classes and its dependencies including protobuf 3. The problem is how to 
> configure spark shell to use my uber jar first. 
> 
> java8964 -- appreciate the link and I will try the configuration. Looks 
> promising. However, the "user classpath first" attribute does not apply to 
> spark-shell, am I correct? 
> 
> Lan
> 
> On Tue, Sep 15, 2015 at 8:24 AM, java8964 <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> It is a bad idea to use the major version change of protobuf, as it most 
> likely won't work.
> 
> But you really want to give it a try, set the "user classpath first", so the 
> protobuf 3 coming with your jar will be used.
> 
> The setting depends on your deployment mode, check this for the parameter:
> 
> https://issues.apache.org/jira/browse/SPARK-2996 
> <https://issues.apache.org/jira/browse/SPARK-2996>
> 
> Yong
> 
> Subject: Re: Change protobuf version or any other third party library version 
> in Spark application
> From: ste...@hortonworks.com <mailto:ste...@hortonworks.com>
> To: ljia...@gmail.com <mailto:ljia...@gmail.com>
> CC: user@spark.apache.org <mailto:user@spark.apache.org>
> Date: Tue, 15 Sep 2015 09:19:28 +
> 
> 
> 
> 
> On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com 
> <mailto:ljia...@gmail.com>> wrote:
> 
> Hi, there,
> 
> I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by 
> default. However, I would like to use Protobuf 3 in my spark application so 
> that I can use some new features such as Map support.  Is there anyway to do 
> that? 
> 
> Right now if I build a uber.jar with dependencies including protobuf 3 
> classes and pass to spark-shell through --jars option, during the execution, 
> I got the error java.lang.NoSuchFieldError: unknownFields. 
> 
> 
> protobuf is an absolute nightmare version-wise, as protoc generates 
> incompatible java classes even across point versions. Hadoop 2.2+ is and will 
> always be protobuf 2.5 only; that applies transitively to downstream projects 
>  (the great protobuf upgrade of 2013 was actually pushed by the HBase team, 
> and required a co-ordinated change across multiple projects)
> 
> 
> Is there anyway to use a different version of Protobuf other than the default 
> one included in the Spark distribution? I guess I can generalize and extend 
> the question to any third party libraries. How to deal with version conflict 
> for any third party libraries included in the Spark distribution? 
> 
> maven shading is the strategy. Generally it is less needed, though the 
> troublesome binaries are,  across the entire apache big data stack:
> 
> google protobuf
> google guava
> kryo
> jackson
> 
> you can generally bump up the other versions, at least by point releases.
> 



Re: Problem while loading saved data

2015-09-02 Thread Guru Medasani
Hi Amila,

Error says that the ‘people.parquet’ file does not exist. Can you manually 
check to see if that file exists?

> Py4JJavaError: An error occurred while calling o53840.parquet.
> : java.lang.AssertionError: assertion failed: No schema defined, and no 
> Parquet data file or summary file found under 
> file:/home/ubuntu/ipython/people.parquet2.


Guru Medasani
gdm...@gmail.com



> On Sep 2, 2015, at 8:25 PM, Amila De Silva <jaa...@gmail.com> wrote:
> 
> Hi All,
> 
> I have a two node spark cluster, to which I'm connecting using IPython 
> notebook.
> To see how data saving/loading works, I simply created a dataframe using 
> people.json using the Code below;
> 
> df = sqlContext.read.json("examples/src/main/resources/people.json")
> 
> Then called the following to save the dataframe as a parquet.
> df.write.save("people.parquet")
> 
> Tried loading the saved dataframe using;
> df2 = sqlContext.read.parquet('people.parquet');
> 
> But this simply fails giving the following exception
> 
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 df2 = sqlContext.read.parquet('people.parquet2');
> 
> /srv/spark/python/pyspark/sql/readwriter.pyc in parquet(self, *path)
> 154 [('name', 'string'), ('year', 'int'), ('month', 'int'), 
> ('day', 'int')]
> 155 """
> --> 156 return 
> self._df(self._jreader.parquet(_to_seq(self._sqlContext._sc, path)))
> 157 
> 158 @since(1.4)
> 
> /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name <http://self.name/>)
> 539 
> 540 for temp_arg in temp_args:
> 
> /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> 
> Py4JJavaError: An error occurred while calling o53840.parquet.
> : java.lang.AssertionError: assertion failed: No schema defined, and no 
> Parquet data file or summary file found under 
> file:/home/ubuntu/ipython/people.parquet2.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:429)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)
>   at scala.Option.orElse(Option.scala:257)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
>   at org.apache.spark.sql.parquet.ParquetRelation2.org 
> <http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:126)
>   at org.apache.spark.sql.parquet.ParquetRelation2.org 
> <http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:124)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:165)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:506)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:505)
>   at 
> org.apache.spark.sql.sources.LogicalRelation.(LogicalRelation.scala:30)
>   at 
> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438)
>   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Guru Medasani
For python it is really great. 

There is some work in progress in bringing Scala support to Jupyter as well.

https://github.com/hohonuuli/sparknotebook 
https://github.com/hohonuuli/sparknotebook

https://github.com/alexarchambault/jupyter-scala 
https://github.com/alexarchambault/jupyter-scala


Guru Medasani
gdm...@gmail.com



 On Aug 18, 2015, at 12:29 PM, Jerry Lam chiling...@gmail.com wrote:
 
 Hi Guru,
 
 Thanks! Great to hear that someone tried it in production. How do you like it 
 so far?
 
 Best Regards,
 
 Jerry
 
 
 On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani gdm...@gmail.com 
 mailto:gdm...@gmail.com wrote:
 Hi Jerry,
 
 Yes. I’ve seen customers using this in production for data science work. I’m 
 currently using this for one of my projects on a cluster as well. 
 
 Also, here is a blog that describes how to configure this. 
 
 http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
  
 http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
 
 
 Guru Medasani
 gdm...@gmail.com mailto:gdm...@gmail.com
 
 
 
 On Aug 18, 2015, at 8:35 AM, Jerry Lam chiling...@gmail.com 
 mailto:chiling...@gmail.com wrote:
 
 Hi spark users and developers,
 
 Did anyone have IPython Notebook (Jupyter) deployed in production that uses 
 Spark as the computational engine? 
 
 I know Databricks Cloud provides similar features with deeper integration 
 with Spark. However, Databricks Cloud has to be hosted by Databricks so we 
 cannot do this. 
 
 Other solutions (e.g. Zeppelin) seem to reinvent the wheel that IPython has 
 already offered years ago. It would be great if someone can educate me the 
 reason behind this.
 
 Best Regards,
 
 Jerry
 
 



Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Guru Medasani
Hi Jerry,

Yes. I’ve seen customers using this in production for data science work. I’m 
currently using this for one of my projects on a cluster as well. 

Also, here is a blog that describes how to configure this. 

http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
 
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/


Guru Medasani
gdm...@gmail.com



 On Aug 18, 2015, at 8:35 AM, Jerry Lam chiling...@gmail.com wrote:
 
 Hi spark users and developers,
 
 Did anyone have IPython Notebook (Jupyter) deployed in production that uses 
 Spark as the computational engine? 
 
 I know Databricks Cloud provides similar features with deeper integration 
 with Spark. However, Databricks Cloud has to be hosted by Databricks so we 
 cannot do this. 
 
 Other solutions (e.g. Zeppelin) seem to reinvent the wheel that IPython has 
 already offered years ago. It would be great if someone can educate me the 
 reason behind this.
 
 Best Regards,
 
 Jerry



Re: Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.

2015-08-03 Thread Guru Medasani
Hi Upen,

Did you deploy the client configs after assigning the gateway roles? You should 
be able to do this from Cloudera Manager. 

Can you try this and let us know what you see when you run spark-shell?

Guru Medasani
gdm...@gmail.com



 On Aug 3, 2015, at 9:10 PM, Upen N ukn...@gmail.com wrote:
 
 Hi,
 I recently installed Cloudera CDH 5.4.4. Sparks comes shipped with this 
 version. I created Spark gateways. But I get the following error when run 
 Spark shell from the gateway. Does anyone have any similar experience ? If 
 so, please share the solution. Google shows to copy the Conf files from data 
 nodes to gateway nodes. But I highly doubt if that is the right fix. 
 
 Thanks
 Upender
 etc/hadoop/conf.cloudera.yarn/topology.py 
 java.io.IOException: Cannot run program
 /etc/hadoop/conf.cloudera.yarn/topology.py



Re: Spark-Submit error

2015-08-03 Thread Guru Medasani
Hi Satish,

Can you add more error or log info to the email?


Guru Medasani
gdm...@gmail.com



 On Jul 31, 2015, at 1:06 AM, satish chandra j jsatishchan...@gmail.com 
 wrote:
 
 HI,
 I have submitted a Spark Job with options jars,class,master as local but i am 
 getting an error as below
 
 dse spark-submit spark error exception in thread main java.io.ioexception: 
 Invalid Request Exception(Why you have not logged in)
 
 Note: submitting datastax spark node
 
 please let me know if anybody have a solutions for this issue
 
 
 
 Regards,
 Saish Chandra



Re: Spark-Submit error

2015-08-03 Thread Guru Medasani
Thanks Satish. I only see the INFO messages and don’t see any error messages in 
the output you pasted. 

Can you paste the log with the error messages?

Guru Medasani
gdm...@gmail.com



 On Aug 3, 2015, at 11:12 PM, satish chandra j jsatishchan...@gmail.com 
 wrote:
 
 Hi Guru,
 I am executing this on DataStax Enterprise Spark node and ~/.dserc file 
 exists which consists Cassandra credentials but still getting the error
 
 Below is the given command 
 
 dse spark-submit --master spark://10.246.43.15:7077 
 http://10.246.43.15:7077/ --class HelloWorld --jars 
 ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar 
 ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar
 
 
 Please find the error log details below
 
 INFO  2015-07-31 05:15:17 
 org.apache.spark.executor.CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT] INFO  2015-07-31 05:15:18 
 org.apache.spark.SecurityManager: Changing view acls to: 
 cassandra,missingmerch INFO  2015-07-31 05:15:18 
 org.apache.spark.SecurityManager: Changing modify acls to: 
 cassandra,missingmerch INFO  2015-07-31 05:15:18 
 org.apache.spark.SecurityManager: SecurityManager: authentication disabled; 
 ui acls disabled; users with view permissions: Set(cassandra, missingmerch); 
 users with modify permissions: Set(cassandra, missingmerch) INFO  2015-07-31 
 05:15:22 akka.event.slf4j.Slf4jLogger: Slf4jLogger started INFO  2015-07-31 
 05:15:22 Remoting: Starting remoting INFO  2015-07-31 05:15:22 Remoting: 
 Remoting started; listening on addresses 
 :[akka.tcp://driverPropsFetcher@10.246.43.14 
 mailto:driverPropsFetcher@10.246.43.14mailto://driverPropsFetcher@10.246.43.14
  mailto:driverPropsFetcher@10.246.43.14:48952]
 INFO  2015-07-31 05:15:22 org.apache.spark.util.Utils: Successfully started 
 service 'driverPropsFetcher' on port 48952.
 INFO  2015-07-31 05:15:24 
 akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote 
 daemon.
 INFO  2015-07-31 05:15:24 
 akka.remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut 
 down; proceeding with flushing remote transports.
 INFO  2015-07-31 05:15:24 org.apache.spark.SecurityManager: Changing view 
 acls to: cassandra,missingmerch INFO  2015-07-31 05:15:24 
 org.apache.spark.SecurityManager: Changing modify acls to: 
 cassandra,missingmerch INFO  2015-07-31 05:15:24 
 org.apache.spark.SecurityManager: SecurityManager: authentication disabled; 
 ui acls disabled; users with view permissions: Set(cassandra, missingmerch); 
 users with modify permissions: Set(cassandra, missingmerch) INFO  2015-07-31 
 05:15:24 akka.event.slf4j.Slf4jLogger: Slf4jLogger started INFO  2015-07-31 
 05:15:24 akka.remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut 
 down.
 INFO  2015-07-31 05:15:24 Remoting: Starting remoting INFO  2015-07-31 
 05:15:24 Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@10.246.43.14 
 mailto:sparkExecutor@10.246.43.14mailto://sparkExecutor@10.246.43.14 
 mailto:sparkExecutor@10.246.43.14:56358]
 INFO  2015-07-31 05:15:24 org.apache.spark.util.Utils: Successfully started 
 service 'sparkExecutor' on port 56358.
 INFO  2015-07-31 05:15:24 
 org.apache.spark.executor.CoarseGrainedExecutorBackend: Connecting to driver: 
 akka.tcp://sparkdri...@tstl400029.wal-mart.com 
 mailto:sparkdri...@tstl400029.wal-mart.commailto://sparkdri...@tstl400029.wal-mart.com
  
 mailto:sparkdri...@tstl400029.wal-mart.com:60525/user/CoarseGrainedScheduler
 INFO  2015-07-31 05:15:24 org.apache.spark.deploy.worker.WorkerWatcher: 
 Connecting to worker akka.tcp://sparkWorker@10.246.43.14 
 mailto:sparkWorker@10.246.43.14mailto://sparkWorker@10.246.43.14 
 mailto:sparkWorker@10.246.43.14:51552/user/Worker
 INFO  2015-07-31 05:15:24 org.apache.spark.deploy.worker.WorkerWatcher: 
 Successfully connected to akka.tcp://sparkWorker@10.246.43.14 
 mailto:sparkWorker@10.246.43.14mailto://sparkWorker@10.246.43.14 
 mailto:sparkWorker@10.246.43.14:51552/user/Worker
 INFO  2015-07-31 05:15:24 
 org.apache.spark.executor.CoarseGrainedExecutorBackend: Successfully 
 registered with driver INFO  2015-07-31 05:15:24 
 org.apache.spark.executor.Executor: Starting executor ID 0 on host 
 10.246.43.14 INFO  2015-07-31 05:15:24 org.apache.spark.SecurityManager: 
 Changing view acls to: cassandra,missingmerch INFO  2015-07-31 05:15:24 
 org.apache.spark.SecurityManager: Changing modify acls to: 
 cassandra,missingmerch INFO  2015-07-31 05:15:24 
 org.apache.spark.SecurityManager: SecurityManager: authentication disabled; 
 ui acls disabled; users with view permissions: Set(cassandra, missingmerch); 
 users with modify permissions: Set(cassandra, missingmerch) INFO  2015-07-31 
 05:15:24 org.apache.spark.util.AkkaUtils: Connecting to MapOutputTracker: 
 akka.tcp://sparkdri...@tstl400029.wal-mart.com 
 mailto:sparkdri...@tstl400029.wal-mart.commailto://sparkdri...@tstl400029.wal-mart.com
  mailto:sparkdri...@tstl400029.wal-mart.com:60525/user/MapOutputTracker
 INFO

Re: How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread Guru Medasani
Hi Ashish,

Are you running Spark-on-YARN on the cluster with an instance of Spark History 
server? 

Also if you are using Cloudera Manager and using Spark on YARN, spark on yarn 
service has a link for the history server web UI. 

Can you paste the command and the output you are seeing in the thread?

Guru Medasani
gdm...@gmail.com



 On Jul 7, 2015, at 10:42 PM, Ashish Dutt ashish.du...@gmail.com wrote:
 
 Hi,
 I have CDH 5.4 installed on a linux server. It has 1 cluster in which spark 
 is deployed as a history server.
 I am trying to connect my laptop to the spark history server.
 When I run spark-shell master ip: port number I get the following output
 How can I verify that the worker is connected to the master?
 
 Thanks,
 Ashish
  


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to verify that the worker is connected to master in CDH5.4

2015-07-07 Thread Guru Medasani
Hi Ashish,

If you are not using Spark on YARN and instead using Spark Standalone, you 
don’t need Spark history server. More on the Web Interfaces is provided in the 
following link. Since are using standalone mode, you should be able to access 
the web UI for the master and workers at ports that Ayan provided in early 
email.

Master: http://masterip:8080 
Worker: http://workerIp:8081

https://spark.apache.org/docs/latest/monitoring.html 
https://spark.apache.org/docs/latest/monitoring.html

If you are using Spark on YARN, spark history server is configured to run on 
port 18080 by default on the server where Spark history server is running.

Guru Medasani
gdm...@gmail.com



 On Jul 8, 2015, at 12:01 AM, Ashish Dutt ashish.du...@gmail.com wrote:
 
 Hello Guru,
 Thank you for your quick response. 
  This is what i get when I try executing spark-shell master ip:port number
 
 C:\spark-1.4.0\binspark-shell master IP:18088
 log4j:WARN No appenders could be found for logger 
 (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig 
 http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 15/07/08 11:28:35 INFO SecurityManager: Changing view acls to: Ashish Dutt
 15/07/08 11:28:35 INFO SecurityManager: Changing modify acls to: Ashish Dutt
 15/07/08 11:28:35 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set
 (Ashish Dutt); users with modify permissions: Set(Ashish Dutt)
 15/07/08 11:28:35 INFO HttpServer: Starting HTTP Server
 15/07/08 11:28:35 INFO Utils: Successfully started service 'HTTP class 
 server' on port 52767.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.4.0
   /_/
 
 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
 Type in expressions to have them evaluated.
 Type :help for more information.
 15/07/08 11:28:39 INFO SparkContext: Running Spark version 1.4.0
 15/07/08 11:28:39 INFO SecurityManager: Changing view acls to: Ashish Dutt
 15/07/08 11:28:39 INFO SecurityManager: Changing modify acls to: Ashish Dutt
 15/07/08 11:28:39 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set
 (Ashish Dutt); users with modify permissions: Set(Ashish Dutt)
 15/07/08 11:28:40 INFO Slf4jLogger: Slf4jLogger started
 15/07/08 11:28:40 INFO Remoting: Starting remoting
 15/07/08 11:28:40 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@10.228.208.74:52780 
 http://sparkDriver@10.228.208.74:52780/]
 15/07/08 11:28:40 INFO Utils: Successfully started service 'sparkDriver' on 
 port 52780.
 15/07/08 11:28:40 INFO SparkEnv: Registering MapOutputTracker
 15/07/08 11:28:40 INFO SparkEnv: Registering BlockManagerMaster
 15/07/08 11:28:40 INFO DiskBlockManager: Created local directory at 
 C:\Users\Ashish Dutt\AppData\Local\Temp\spark-80c4f1fe-37de-4aef
 -9063-cae29c488382\blockmgr-a967422b-05e8-4fc1-b60b-facc7dbd4414
 15/07/08 11:28:40 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
 15/07/08 11:28:40 INFO HttpFileServer: HTTP File server directory is 
 C:\Users\Ashish Dutt\AppData\Local\Temp\spark-80c4f1fe-37de-4ae
 f-9063-cae29c488382\httpd-928f4485-ea08-4749-a478-59708db0fefa
 15/07/08 11:28:40 INFO HttpServer: Starting HTTP Server
 15/07/08 11:28:40 INFO Utils: Successfully started service 'HTTP file server' 
 on port 52781.
 15/07/08 11:28:40 INFO SparkEnv: Registering OutputCommitCoordinator
 15/07/08 11:28:40 INFO Utils: Successfully started service 'SparkUI' on port 
 4040.
 15/07/08 11:28:40 INFO SparkUI: Started SparkUI at http://10.228.208.74:4040 
 http://10.228.208.74:4040/
 15/07/08 11:28:40 INFO Executor: Starting executor ID driver on host localhost
 15/07/08 11:28:41 INFO Executor: Using REPL class URI: 
 http://10.228.208.74:52767 http://10.228.208.74:52767/
 15/07/08 11:28:41 INFO Utils: Successfully started service 
 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52800.
 
 15/07/08 11:28:41 INFO NettyBlockTransferService: Server created on 52800
 15/07/08 11:28:41 INFO BlockManagerMaster: Trying to register BlockManager
 15/07/08 11:28:41 INFO BlockManagerMasterEndpoint: Registering block manager 
 localhost:52800 with 265.4 MB RAM, BlockManagerId(drive
 r, localhost, 52800)
 15/07/08 11:28:41 INFO BlockManagerMaster: Registered BlockManager
 
 15/07/08 11:28:41 INFO SparkILoop: Created spark context..
 Spark context available as sc.
 15/07/08 11:28:41 INFO HiveContext: Initializing execution hive, version 
 0.13.1
 15/07/08 11:28:42 INFO HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore
 15/07/08 11:28:42

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Guru Medasani
Hi Elkhan,

There are couple of ways to do this.

1) Spark-jobserver is a popular web server that is used to submit spark jobs.

https://github.com/spark-jobserver/spark-jobserver 
https://github.com/spark-jobserver/spark-jobserver

2) Spark-submit script sets the classpath for the job. Bypassing the 
spark-submit script means you have to manage some of this work in your program 
itself.  

Here is a link with some discussions around how to handle this scenario. 

http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/What-dependencies-to-submit-Spark-jobs-programmatically-not-via/td-p/24721
 
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/What-dependencies-to-submit-Spark-jobs-programmatically-not-via/td-p/24721


Guru Medasani
gdm...@gmail.com



 On Jun 17, 2015, at 6:01 PM, Elkhan Dadashov elkhan8...@gmail.com wrote:
 
 This is not independent programmatic way of running of Spark job on Yarn 
 cluster.
 
 That example demonstrates running on Yarn-client mode, also will be dependent 
 of Jetty. Users writing Spark programs do not want to depend on that.
 
 I found this SparkLauncher class introduced in Spark 1.4 version 
 (https://github.com/apache/spark/tree/master/launcher 
 https://github.com/apache/spark/tree/master/launcher) which allows running 
 Spark jobs in programmatic way. 
 
 SparkLauncher exists in Java and Scala APIs, but I could not find in Python 
 API.
 
 Did not try it yet, but seems promising.
 
 Example:
 
 import org.apache.spark.launcher.SparkLauncher;
 
 public class MyLauncher {
 
 public static void main(String[] args) throws Exception {
 
  Process spark = new SparkLauncher()
 
.setAppResource(/my/app.jar)
 
.setMainClass(my.spark.app.Main)
 
.setMaster(local)
 
.setConf(SparkLauncher.DRIVER_MEMORY, 2g)
 
 .launch();
 
   spark.waitFor();
 
}
 
   }
 
 }
 
 
 
 On Wed, Jun 17, 2015 at 5:51 PM, Corey Nolet cjno...@gmail.com 
 mailto:cjno...@gmail.com wrote:
 An example of being able to do this is provided in the Spark Jetty Server 
 project [1] 
 
 [1] https://github.com/calrissian/spark-jetty-server 
 https://github.com/calrissian/spark-jetty-server
 
 On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com 
 mailto:elkhan8...@gmail.com wrote:
 Hi all,
 
 Is there any way running Spark job in programmatic way on Yarn cluster 
 without using spark-submit script ?
 
 I cannot include Spark jars on my Java application (due o dependency conflict 
 and other reasons), so I'll be shipping Spark assembly uber jar 
 (spark-assembly-1.3.1-hadoop2.3.0.jar) to Yarn cluster, and then execute job 
 (Python or Java) on Yarn-cluster.
 
 So is there any way running Spark job implemented in python file/Java class 
 without calling it through spark-submit script ?
 
 Thanks.
 
 
 
 
 
 
 -- 
 
 Best regards,
 Elkhan Dadashov



Re: SparkR 1.4.0: read.df() function fails

2015-06-16 Thread Guru Medasani
Hi Esten,

Looks like your sqlContext is connected to a Hadoop/Spark cluster, but the file 
path you specified is local?. 

mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json”,

Error below shows that the Input path you specified does not exist on the 
cluster. Pointing to the right hdfs path should be able to help here. 

 Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does
 not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
 hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json

Guru Medasani
gdm...@gmail.com



 On Jun 16, 2015, at 10:39 AM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:
 
 The error you are running into is that the input file does not exist -- You 
 can see it from the following line
 Input path does not exist: 
 hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json
 
 Thanks
 Shivaram
 
 On Tue, Jun 16, 2015 at 1:55 AM, esten erik.stens...@dnvgl.com 
 mailto:erik.stens...@dnvgl.com wrote:
 Hi,
 In SparkR shell, I invoke:
  mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json,
  header=false)
 I have tried various filetypes (csv, txt), all fail.
 
 RESPONSE: ERROR RBackendHandler: load on 1 failed
 BELOW THE WHOLE RESPONSE:
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with
 curMem=0, maxMem=278302556
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in
 memory (estimated size 173.4 KB, free 265.2 MB)
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with
 curMem=177600, maxMem=278302556
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes
 in memory (estimated size 16.2 KB, free 265.2 MB)
 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
 on localhost:37142 (size: 16.2 KB, free: 265.4 MB)
 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at
 NativeMethodAccessorImpl.java:-2
 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads
 feature cannot be used because libhadoop cannot be loaded.
 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed
 java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
 at
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
 at
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)
 at
 io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 at
 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 at
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
 at
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
 at
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
 at
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
 at
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
 at
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
 at
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 at
 io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does
 not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json
 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228

Re: Spark 1.4 release date

2015-06-12 Thread Guru Medasani
Here is a spark 1.4 release blog by data bricks.

https://databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html 
https://databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html


Guru Medasani
gdm...@gmail.com



 On Jun 12, 2015, at 7:08 AM, ayan guha guha.a...@gmail.com wrote:
 
 Thanks a lot.
 
 On 12 Jun 2015 19:46, Todd Nist tsind...@gmail.com 
 mailto:tsind...@gmail.com wrote:
 It was released yesterday.
 
 On Friday, June 12, 2015, ayan guha guha.a...@gmail.com 
 mailto:guha.a...@gmail.com wrote:
 Hi
 
 When is official spark 1.4 release date?
 Best
 Ayan
 



Re: Nightly builds/releases?

2015-05-04 Thread Guru Medasani
I see a Jira for this one, but unresolved. 

https://issues.apache.org/jira/browse/SPARK-1517 
https://issues.apache.org/jira/browse/SPARK-1517




 On May 4, 2015, at 10:25 PM, Ankur Chauhan achau...@brightcove.com wrote:
 
 Hi,
 
 Does anyone know if spark has any nightly builds or equivalent that provides 
 binaries that have passed a CI build so that one could try out the bleeding 
 edge without having to compile.
 
 -- Ankur



Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-28 Thread Guru Medasani
Hi Antony, Did you get pass this error by repartitioning your job with 
smaller tasks as Sven Krasser pointed out?

From:  Antony Mayi antonym...@yahoo.com
Reply-To:  Antony Mayi antonym...@yahoo.com
Date:  Tuesday, January 27, 2015 at 5:24 PM
To:  Guru Medasani gdm...@outlook.com, Sven Krasser kras...@gmail.com
Cc:  Sandy Ryza sandy.r...@cloudera.com, user@spark.apache.org 
user@spark.apache.org
Subject:  Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

I have yarn configured with yarn.nodemanager.vmem-check-enabled=false and 
yarn.nodemanager.pmem-check-enabled=false to avoid yarn killing the 
containers.

the stack trace is bellow.

thanks,
Antony.

15/01/27 17:02:53 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
SIGNAL 15: SIGTERM
15/01/27 17:02:53 ERROR executor.Executor: Exception in task 21.0 in stage 
12.0 (TID 1312)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:642)
at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
at 
scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal
a:33)
at 
scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156)
at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)
at 
scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156)
at 
org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommenda
tion$ALS$$makeOutLinkBlock(ALS.scala:404)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at 
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.sca
la:130)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.sca
la:127)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(Traver
sableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal
a:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:7
71)
at 
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:127)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:
31)
15/01/27 17:02:53 ERROR util.SparkUncaughtExceptionHandler: Uncaught 
exception in thread Thread[Executor task launch worker-8,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:642)
at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
at 
scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal
a:33)
at 
scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156)
at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)
at 
scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156)
at 
org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommenda
tion$ALS$$makeOutLinkBlock(ALS.scala:404)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459)
at 
org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Guru Medasani
Hi Anthony,

What is the setting of the total amount of memory in MB that can be 
allocated to containers on your NodeManagers?

yarn.nodemanager.resource.memory-mb

Can you check this above configuration in yarn-site.xml used by the node 
manager process?

-Guru Medasani

From:  Sandy Ryza sandy.r...@cloudera.com
Date:  Tuesday, January 27, 2015 at 3:33 PM
To:  Antony Mayi antonym...@yahoo.com
Cc:  user@spark.apache.org user@spark.apache.org
Subject:  Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Hi Antony,

If you look in the YARN NodeManager logs, do you see that it's killing the 
executors?  Or are they crashing for a different reason?

-Sandy

On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi 
antonym...@yahoo.com.invalid wrote:
Hi,

I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors 
crashed with this error.

does that mean I have genuinely not enough RAM or is this matter of config 
tuning?

other config options used:
spark.storage.memoryFraction=0.3
SPARK_EXECUTOR_MEMORY=14G

running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is 
ALS trainImplicit on ~15GB dataset)

thanks for any ideas,
Antony.




Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Guru Medasani
Can you attach the logs where this is failing?

From:  Sven Krasser kras...@gmail.com
Date:  Tuesday, January 27, 2015 at 4:50 PM
To:  Guru Medasani gdm...@outlook.com
Cc:  Sandy Ryza sandy.r...@cloudera.com, Antony Mayi 
antonym...@yahoo.com, user@spark.apache.org user@spark.apache.org
Subject:  Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Since it's an executor running OOM it doesn't look like a container being 
killed by YARN to me. As a starting point, can you repartition your job 
into smaller tasks?
-Sven

On Tue, Jan 27, 2015 at 2:34 PM, Guru Medasani gdm...@outlook.com wrote:
Hi Anthony,

What is the setting of the total amount of memory in MB that can be 
allocated to containers on your NodeManagers?

yarn.nodemanager.resource.memory-mb

Can you check this above configuration in yarn-site.xml used by the node 
manager process?

-Guru Medasani

From:  Sandy Ryza sandy.r...@cloudera.com
Date:  Tuesday, January 27, 2015 at 3:33 PM
To:  Antony Mayi antonym...@yahoo.com
Cc:  user@spark.apache.org user@spark.apache.org
Subject:  Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

Hi Antony,

If you look in the YARN NodeManager logs, do you see that it's killing the 
executors?  Or are they crashing for a different reason?

-Sandy

On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi 
antonym...@yahoo.com.invalid wrote:
Hi,

I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors 
crashed with this error.

does that mean I have genuinely not enough RAM or is this matter of config 
tuning?

other config options used:
spark.storage.memoryFraction=0.3
SPARK_EXECUTOR_MEMORY=14G

running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is 
ALS trainImplicit on ~15GB dataset)

thanks for any ideas,
Antony.




-- 
http://sites.google.com/site/krasser/?utm_source=sig



RE: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Guru Medasani
Hi Vladimir,
From the link Sean posted, if you use Java 8 there is this following note.
Note: For Java 8 and above this step is not required.

So if you have no problems using Java 8, give it a shot. 

Best Regards,Guru Medasani




 From: so...@cloudera.com
 Date: Tue, 23 Dec 2014 15:04:42 +
 Subject: Re: Spark Installation Maven PermGen OutOfMemoryException
 To: protsenk...@gmail.com
 CC: user@spark.apache.org
 
 You might try a little more. The official guidance suggests 2GB:
 
 https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage
 
 
 On Tue, Dec 23, 2014 at 2:57 PM, Vladimir Protsenko
 protsenk...@gmail.com wrote:
  I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded code from github,
  installed maven and trying to compile system:
 
  git clone https://github.com/apache/spark.git
  git checkout v1.2.0
  mvn -DskipTests clean package
 
  leads to OutOfMemoryException. What amount of memory does it requires?
 
  export MAVEN_OPTS=`-Xmx=1500m -XX:MaxPermSize=512m
  -XX:ReservedCodeCacheSize=512m` doesn't help.
 
  Waht is a straight forward way to start using Spark?
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Installation-Maven-PermGen-OutOfMemoryException-tp20831.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

RE: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Guru Medasani
Thanks for the clarification Sean. 

Best Regards,Guru Medasani




 From: so...@cloudera.com
 Date: Tue, 23 Dec 2014 15:39:59 +
 Subject: Re: Spark Installation Maven PermGen OutOfMemoryException
 To: gdm...@outlook.com
 CC: protsenk...@gmail.com; user@spark.apache.org
 
 The text there is actually unclear. In Java 8, you still need to set
 the max heap size (-Xmx2g). The optional bit is the
 -XX:MaxPermSize=512M actually. Java 8 no longer has a separate
 permanent generation.
 
 On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com wrote:
  Hi Vladimir,
 
  From the link Sean posted, if you use Java 8 there is this following note.
 
  Note: For Java 8 and above this step is not required.
 
  So if you have no problems using Java 8, give it a shot.
 
  Best Regards,
  Guru Medasani
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Guru Medasani
I am able to run Spark jobs and Spark Streaming jobs successfully via YARN on a 
CDH cluster. 

When you mean YARN isn’t quite there yet, you mean to submit the jobs 
programmatically? or just in general?
 

On Sep 4, 2014, at 1:45 AM, Matt Chu m...@kabam.com wrote:

 https://github.com/spark-jobserver/spark-jobserver
 
 Ooyala's Spark jobserver is the current de facto standard, IIUC. I just added 
 it to our prototype stack, and will begin trying it out soon. Note that you 
 can only do standalone or Mesos; YARN isn't quite there yet.
 
 (The repo just moved from https://github.com/ooyala/spark-jobserver, so don't 
 trust Google on this one (yet); development is happening in the first repo.)
 
 
 
 On Wed, Sep 3, 2014 at 11:39 PM, Vicky Kak vicky@gmail.com wrote:
 I have been able to submit the spark jobs using the submit script but I would 
 like to do it via code.
 I am unable to search anything matching to my need.
 I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may be 
 have to write some utility that passes the parameters required for this class.
 I would be interested to know how community is doing.
 
 Thanks,
 Vicky
 



Re: Spark-submit not running

2014-08-28 Thread Guru Medasani
Can you copy the exact spark-submit command that you are running?

You should be able to run it locally without installing hadoop. 

Here is an example on how to run the job locally.

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100


On Aug 28, 2014, at 7:54 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote:

 How can I set HADOOP_HOME if I am running the Spark on my local machine 
 without anything else? Do I have to install some other pre-built file? I am 
 on Windows 7 and Spark’s official site says that it is available on Windows, 
 I added Java path in the PATH variable.
  
 Vineet
  
 From: Sean Owen [mailto:so...@cloudera.com] 
 Sent: Donnerstag, 28. August 2014 13:49
 To: Hingorani, Vineet
 Cc: user@spark.apache.org
 Subject: Re: Spark-submit not running
  
 You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows 
 or not at this stage? I know the build doesn't quite yet.
  
 
 On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet 
 vineet.hingor...@sap.com wrote:
 The file is compiling properly but when I try to run the jar file using 
 spark-submit, it is giving some errors. I am running spark locally and have 
 downloaded a pre-built version of Spark named “For Hadoop 2 (HDP2, CDH5)”.AI 
 don’t know if it is a dependency problem but I don’t want to have Hadoop in 
 my system. The error says:
  
 14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary in 
 the hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
  
 Vineet



Re: Spark-submit not running

2014-08-28 Thread Guru Medasani
Thanks Sean.

Looks like there is a workaround as per the JIRA 
https://issues.apache.org/jira/browse/SPARK-2356 .

http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7.

May be that's worth a shot?

On Aug 28, 2014, at 8:15 AM, Sean Owen so...@cloudera.com wrote:

 Yes, but I think at the moment there is still a dependency on Hadoop even 
 when not using it. See https://issues.apache.org/jira/browse/SPARK-2356  
 
 
 On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani gdm...@outlook.com wrote:
 Can you copy the exact spark-submit command that you are running?
 
 You should be able to run it locally without installing hadoop. 
 
 Here is an example on how to run the job locally.
 
 
 # Run application locally on 8 cores
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master local[8] \
   /path/to/examples.jar \
   100