Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Ted Yu
Since both scala and java files are involved in the PR, I don't see an easy
way around without building yourself.

Cheers

On Wed, Dec 16, 2015 at 10:18 AM, Saiph Kappa  wrote:

> Exactly, but it's only fixed for the next spark version. Is there any work
> around for version 1.5.2?
>
> On Wed, Dec 16, 2015 at 4:36 PM, Ted Yu  wrote:
>
>> This seems related:
>> [SPARK-10123][DEPLOY] Support specifying deploy mode from configuration
>>
>> FYI
>>
>> On Wed, Dec 16, 2015 at 7:31 AM, Saiph Kappa 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a client application running on host0 that is launching multiple
>>> drivers on multiple remote standalone spark clusters (each cluster is
>>> running on a single machine):
>>>
>>> «
>>> ...
>>>
>>> List("host1", "host2" , "host3").foreach(host => {
>>>
>>> val sparkConf = new SparkConf()
>>> sparkConf.setAppName("App")
>>>
>>> sparkConf.set("spark.driver.memory", "4g")
>>> sparkConf.set("spark.executor.memory", "4g")
>>> sparkConf.set("spark.driver.maxResultSize", "4g")
>>> sparkConf.set("spark.serializer", 
>>> "org.apache.spark.serializer.KryoSerializer")
>>> sparkConf.set("spark.executor.extraJavaOptions", " -XX:+UseCompressedOops 
>>> -XX:+UseConcMarkSweepGC " +
>>>   "-XX:+AggressiveOpts -XX:FreqInlineSize=300 -XX:MaxInlineSize=300 ")
>>>
>>> sparkConf.setMaster(s"spark://$host:7077")
>>>
>>> val rawStreams = (1 to source.parallelism).map(_ => 
>>> ssc.textFileStream("/home/user/data/")).toArray
>>> val rawStream = ssc.union(rawStreams)
>>> rawStream.count.map(c => s"Received $c records.").print()
>>>
>>> }
>>> ...
>>>
>>> »
>>>
>>> The problem is that I'm getting an error message saying that the directory 
>>> "/home/user/data/" does not exist.
>>> In fact, this directory only exists in host1, host2 and host3 and not in 
>>> host0.
>>> But since I'm launching the driver to host1..3 I thought data would be 
>>> fetched from those machines.
>>>
>>> I'm also trying to avoid using the spark submit script, and couldn't find 
>>> the configuration parameter to specify the deploy mode.
>>>
>>> Is there any way to specify the deploy mode through configuration parameter?
>>>
>>>
>>> Thanks.
>>>
>>>
>>
>


Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Bryan Cutler
To follow up with your other issue, if you are just trying to count
elements in a DStream, you can do that without an Accumulator.  foreachRDD
is meant to be an output action, it does not return anything and it is
actually run in the driver program.  Because Java (before 8) handles
closures a little differently, it might be easiest to implement the
function to pass to foreachRDD as something like this:

class MyFunc implements VoidFunction {

  public long total = 0;

  @Override
  public void call(JavaRDD rdd) {
System.out.println("foo " + rdd.collect().toString());
total += rdd.count();
  }
}

MyFunc f = new MyFunc();

inputStream.foreachRDD(f);

// f.total will have the count of all RDDs

Hope that helps some!

-bryan

On Wed, Dec 16, 2015 at 8:37 AM, Bryan Cutler  wrote:

> Hi Andy,
>
> Regarding the foreachrdd return value, this Jira that will be in 1.6
> should take care of that https://issues.apache.org/jira/browse/SPARK-4557
> and make things a little simpler.
> On Dec 15, 2015 6:55 PM, "Andy Davidson" 
> wrote:
>
>> I am writing  a JUnit test for some simple streaming code. I want to
>> make assertions about how many things are in a given JavaDStream. I wonder
>> if there is an easier way in Java to get the count?
>>
>> I think there are two points of friction.
>>
>>
>>1. is it easy to create an accumulator of type double or int, How
>>ever Long is not supported
>>2. We need to use javaDStream.foreachRDD. The Function interface must
>>return void. I was not able to define an accumulator in my driver and
>>use a lambda function. (I am new to lambda in Java)
>>
>> Here is a little lambda example that logs my test objects. I was not
>> able to figure out how to get  to return a value or access a accumulator
>>
>>data.foreachRDD(rdd -> {
>>
>> logger.info(“Begin data.foreachRDD" );
>>
>> for (MyPojo pojo : rdd.collect()) {
>>
>> logger.info("\n{}", pojo.toString());
>>
>> }
>>
>> return null;
>>
>> });
>>
>>
>> Any suggestions would be greatly appreciated
>>
>> Andy
>>
>> This following code works in my driver but is a lot of code for such a
>> trivial computation. Because it needs to the JavaSparkContext I do not
>> think it would work inside a closure. I assume the works do not have access
>> to the context as a global and that it shipping it in the closure is not a
>> good idea?
>>
>> public class JavaDStreamCount implements Serializable {
>>
>> private static final long serialVersionUID = -3600586183332429887L;
>>
>> public static Logger logger =
>> LoggerFactory.getLogger(JavaDStreamCount.class);
>>
>>
>>
>> public Double hack(JavaSparkContext sc, JavaDStream javaDStream) {
>>
>> Count c = new Count(sc);
>>
>> javaDStream.foreachRDD(c);
>>
>> return c.getTotal().value();
>>
>> }
>>
>>
>>
>> class Count implements Function {
>>
>> private static final long serialVersionUID =
>> -5239727633710162488L;
>>
>> Accumulator total;
>>
>>
>>
>> public Count(JavaSparkContext sc) {
>>
>> total = sc.accumulator(0.0);
>>
>> }
>>
>>
>>
>> @Override
>>
>> public java.lang.Void call(JavaRDD rdd) throws Exception {
>>
>> List data = rdd.collect();
>>
>> int dataSize = data.size();
>>
>> logger.error("data.size:{}", dataSize);
>>
>> long num = rdd.count();
>>
>> logger.error("num:{}", num);
>>
>> total.add(new Double(num));
>>
>> return null;
>>
>> }
>>
>>
>> public Accumulator getTotal() {
>>
>> return total;
>>
>> }
>>
>> }
>>
>> }
>>
>>
>>
>>
>>


Scala VS Java VS Python

2015-12-16 Thread Daniel Valdivia
Hello,

This is more of a "survey" question for the community, you can reply to me 
directly so we don't flood the mailing list.

I'm having a hard time learning Spark using Python since the API seems to be 
slightly incomplete, so I'm looking at my options to start doing all my apps in 
either Scala or Java, being a Java Developer, java 1.8 looks like the logical 
way, however I'd like to ask here what's the most common (Scala Or Java) since 
I'm observing mixed results in the social documentation, however Scala seems to 
be the predominant language for spark examples.

Thank for the advice
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Parquet datasource optimization for distinct query

2015-12-16 Thread pnpritchard
I have a parquet file that is partitioned by a column, like shown in
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery.
I like this storage technique because the datasource can "push-down" filters
on the partitioned column, making some queries a lot faster.

However, there is additional "push-down" optimization that could be done,
specifically on distinct queries. For example, a distinct on the partitioned
column only requires scanning the directory structure; no files need to be
read. This can make a huge difference when there is a lot of data.

This kind of optimization doesn't seem possible to implement currently
because the ParquetRelation only receives an array of Filter, and not the
full logical plan (which would include the distinct expression). Are there
any plans to change this? Has anyone else encountered this and figured out a
workaround?

Thanks!
Nick



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-datasource-optimization-for-distinct-query-tp25721.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



count(*) performance in Hive vs Spark DataFrames

2015-12-16 Thread Christopher Brady
I'm having an issue where count(*) returns almost immediately using 
Hive, but takes over 10 min using DataFrames. The table data is on HDFS 
in an uncompressed CSV format. How is it possible for Hive to get the 
count so fast? Is it caching this or putting it in the metastore?


Is there anything I can do to optimize the performance of this using 
DataFrames, or should I try doing just the count with Hive using JDBC?


I've tried writing this 2 ways:

try (final JavaSparkContext sc = new JavaSparkContext("yarn-cluster", 
"Test app")) {

final HiveContext sqlContext = new HiveContext(sc.sc());
DataFrame df = sqlContext.sql("SELECT count(*) FROM my_table");
df.collect();
}

try (final JavaSparkContext sc = new JavaSparkContext("yarn-cluster", 
"Test app")) {

final HiveContext sqlContext = new HiveContext(sc.sc());
DataFrame df = sqlContext.table("my_table");
df.count();
}

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala VS Java VS Python

2015-12-16 Thread Daniel Lopes
For me Scala is better like Spark is written in Scala, and I like python
cuz I always used python for data science. :)

On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia 
wrote:

> Hello,
>
> This is more of a "survey" question for the community, you can reply to me
> directly so we don't flood the mailing list.
>
> I'm having a hard time learning Spark using Python since the API seems to
> be slightly incomplete, so I'm looking at my options to start doing all my
> apps in either Scala or Java, being a Java Developer, java 1.8 looks like
> the logical way, however I'd like to ask here what's the most common (Scala
> Or Java) since I'm observing mixed results in the social documentation,
> however Scala seems to be the predominant language for spark examples.
>
> Thank for the advice
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
*Daniel Lopes, B.Eng*
Data Scientist - BankFacil
CREA/SP 5069410560

Mob +55 (18) 99764-2733 
Ph +55 (11) 3522-8009
http://about.me/dannyeuu

Av. Nova Independência, 956, São Paulo, SP
Bairro Brooklin Paulista
CEP 04570-001
https://www.bankfacil.com.br


Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
When you re-run the last statement a second time, does it work? Could it be
related to https://issues.apache.org/jira/browse/SPARK-12350 ?

On 16 December 2015 at 10:39, Ted Yu  wrote:

> Hi,
> I used the following command on a recently refreshed checkout of master
> branch:
>
> ~/apache-maven-3.3.3/bin/mvn -Phive -Phive-thriftserver -Pyarn
> -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests
>
> I was then running simple query in spark-shell:
> Seq(
>   (83, 0, 38),
>   (26, 0, 79),
>   (43, 81, 24)
> ).toDF("a", "b", "c").registerTempTable("cachedData")
>
> sqlContext.cacheTable("cachedData")
> sqlContext.sql("select * from cachedData").show
>
> However, I encountered errors in the following form:
>
> http://pastebin.com/QeiwJpwi
>
> Under workspace, I found:
>
> ./sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedClass.class
>
> but no ByteOrder.class.
>
> Did I miss some step(s) ?
>
> Thanks
>


Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
Yeah, the same kind of error actually happens in the JIRA. It actually
succeeds but a load of exceptions are thrown. Subsequent runs don't produce
any errors anymore.

On 16 December 2015 at 10:55, Ted Yu  wrote:

> The first run actually worked. It was the amount of exceptions preceding
> the result that surprised me.
>
> I want to see if there is a way of getting rid of the exceptions.
>
> Thanks
>
> On Wed, Dec 16, 2015 at 10:53 AM, Jakob Odersky 
> wrote:
>
>> When you re-run the last statement a second time, does it work? Could it
>> be related to https://issues.apache.org/jira/browse/SPARK-12350 ?
>>
>> On 16 December 2015 at 10:39, Ted Yu  wrote:
>>
>>> Hi,
>>> I used the following command on a recently refreshed checkout of master
>>> branch:
>>>
>>> ~/apache-maven-3.3.3/bin/mvn -Phive -Phive-thriftserver -Pyarn
>>> -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests
>>>
>>> I was then running simple query in spark-shell:
>>> Seq(
>>>   (83, 0, 38),
>>>   (26, 0, 79),
>>>   (43, 81, 24)
>>> ).toDF("a", "b", "c").registerTempTable("cachedData")
>>>
>>> sqlContext.cacheTable("cachedData")
>>> sqlContext.sql("select * from cachedData").show
>>>
>>> However, I encountered errors in the following form:
>>>
>>> http://pastebin.com/QeiwJpwi
>>>
>>> Under workspace, I found:
>>>
>>> ./sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedClass.class
>>>
>>> but no ByteOrder.class.
>>>
>>> Did I miss some step(s) ?
>>>
>>> Thanks
>>>
>>
>>
>


Re: Can't create UDF through thriftserver, no error reported

2015-12-16 Thread Antonio Piccolboni
Mmmmh, I may have been stuck with some stale lib. A complete reset of the
client (when by complete I mean beyond what reason would require) solved
the problem. I think we can consider this solved unless new evidence
appears. Thanks!

On Wed, Dec 16, 2015 at 9:16 AM Antonio Piccolboni 
wrote:

> Can repro only in custom client, not beeline. Let me dig deeper.
>
> On Wed, Dec 16, 2015 at 9:03 AM Antonio Piccolboni <
> anto...@piccolboni.info> wrote:
>
>> Hi Jeff,
>> the ticket is certainly relevant, thanks for digging it out, but as I
>> said I can repro in 1.6.0-rc2. Will try again just to make sure.
>>
>> On Tue, Dec 15, 2015 at 5:17 PM Jeff Zhang  wrote:
>>
>>> It should be resolved by this ticket
>>> https://issues.apache.org/jira/browse/SPARK-11191
>>>
>>>
>>>
>>> On Wed, Dec 16, 2015 at 3:14 AM, Antonio Piccolboni <
>>> anto...@piccolboni.info> wrote:
>>>
 Hi,
 I am trying to create a UDF using the thiftserver. I followed this
 example , which is originally
 for hive. My understanding is that the thriftserver creates a hivecontext
 and Hive UDFs should be supported. I then sent this query to the
 thriftserver (I use the RJDBC module for R but I doubt any other JDBC
 client would be any different):


 CREATE TEMPORARY FUNCTION NVL2 AS 'khanolkar.HiveUDFs.NVL2GenericUDF'

 I only changed some name wrt  the posted examples, but I think the
 class was found just right because 1)There's no errors in the log or
 console 2)I can generate a class not found error mistyping the class name,
 and I see it in the logs 3) I can use the reflect builtin to invoke a
 different function that I wrote and supplied to spark in the same way
 (--jars option to start-thriftserver)

 After this, I can't use the NVL2 function in a query and I can't even
 do a  DESCRIBE query on it,  nor does it list with SHOW FUNCTIONS. I tried
 both 1.5.1 and 1.6.0-rc2 built with thriftserver support for Hadoop 2.6

 I know the HiveContext is slightly behind the latest Hive as far as
 features, I believe one or two revs, so that may be one potential problem,
 but all these feature I believe are present in Hive 0.11 and should have
 made it into Spark. At the very least, I would like to see some message in
 the logs and console so that I can find the error of my ways, repent and
 fix my code. Any suggestions? Anything I should post to support
 troubleshooting? Is this JIRA-worthy? Thanks

 Antonio




>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>


Setting the vote rate in a Random Forest in MLlib

2015-12-16 Thread Young, Matthew T
One of our data scientists is interested in using Spark to improve performance 
in some random forest binary classifications, but isn't getting good enough 
results from MLlib's implementation of the random forest compared to R's 
randomforest library with the available parameters. She suggested that if she 
could tune the vote rate of the forest (how many trees are required to vote 
true to cause a categorization) she might be able to reach the false positive 
and true positive targets for the project.

Is there any way to set the vote rate for a random forest in Spark 1.5.2? I 
don't see any such option in the trainClassifier 
API.

Thanks,

-- Matthew


File not found error running query in spark-shell

2015-12-16 Thread Ted Yu
Hi,
I used the following command on a recently refreshed checkout of master
branch:

~/apache-maven-3.3.3/bin/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4
-Dhadoop.version=2.7.0 package -DskipTests

I was then running simple query in spark-shell:
Seq(
  (83, 0, 38),
  (26, 0, 79),
  (43, 81, 24)
).toDF("a", "b", "c").registerTempTable("cachedData")

sqlContext.cacheTable("cachedData")
sqlContext.sql("select * from cachedData").show

However, I encountered errors in the following form:

http://pastebin.com/QeiwJpwi

Under workspace, I found:
./sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedClass.class

but no ByteOrder.class.

Did I miss some step(s) ?

Thanks


Re: HiveContext Self join not reading from cache

2015-12-16 Thread Ted Yu
I did the following exercise in spark-shell ("c" is cached table):

scala> sqlContext.sql("select x.b from c x join c y on x.a = y.a").explain
== Physical Plan ==
Project [b#4]
+- BroadcastHashJoin [a#3], [a#125], BuildRight
   :- InMemoryColumnarTableScan [b#4,a#3], InMemoryRelation [a#3,b#4,c#5],
true, 1, StorageLevel(true, true, false, true, 1), ConvertToUnsafe,
Some(c)
   +- InMemoryColumnarTableScan [a#125], InMemoryRelation
[a#125,b#126,c#127], true, 1, StorageLevel(true, true, false, true, 1),
ConvertToUnsafe, Some(c)

sqlContext.sql("select x.b, y.c from c x join c y on x.a =
y.a").registerTempTable("d")
scala> sqlContext.cacheTable("d")

scala> sqlContext.sql("select x.b from d x join d y on x.c = y.c").explain
== Physical Plan ==
Project [b#4]
+- SortMergeJoin [c#90], [c#253]
   :- Sort [c#90 ASC], false, 0
   :  +- TungstenExchange hashpartitioning(c#90,200), None
   : +- InMemoryColumnarTableScan [b#4,c#90], InMemoryRelation
[b#4,c#90], true, 1, StorageLevel(true, true, false, true, 1), Project
[b#4,c#90], Some(d)
   +- Sort [c#253 ASC], false, 0
  +- TungstenExchange hashpartitioning(c#253,200), None
 +- InMemoryColumnarTableScan [c#253], InMemoryRelation
[b#246,c#253], true, 1, StorageLevel(true, true, false, true, 1),
Project [b#4,c#90], Some(d)

Is the above what you observed ?

Cheers

On Wed, Dec 16, 2015 at 9:34 AM, Gourav Sengupta 
wrote:

> Hi,
>
> This is how the data  can be created:
>
> 1. TableA : cached()
> 2. TableB : cached()
> 3. TableC: TableA inner join TableB cached()
> 4. TableC join TableC does not take the data from cache but starts reading
> the data for TableA and TableB from disk.
>
> Does this sound like a bug? The self join between TableA and TableB are
> working fine and taking data from cache.
>
>
> Regards,
> Gourav
>


Re: Can't create UDF through thriftserver, no error reported

2015-12-16 Thread Antonio Piccolboni
Hi Jeff,
the ticket is certainly relevant, thanks for digging it out, but as I said
I can repro in 1.6.0-rc2. Will try again just to make sure.

On Tue, Dec 15, 2015 at 5:17 PM Jeff Zhang  wrote:

> It should be resolved by this ticket
> https://issues.apache.org/jira/browse/SPARK-11191
>
>
>
> On Wed, Dec 16, 2015 at 3:14 AM, Antonio Piccolboni <
> anto...@piccolboni.info> wrote:
>
>> Hi,
>> I am trying to create a UDF using the thiftserver. I followed this
>> example , which is originally
>> for hive. My understanding is that the thriftserver creates a hivecontext
>> and Hive UDFs should be supported. I then sent this query to the
>> thriftserver (I use the RJDBC module for R but I doubt any other JDBC
>> client would be any different):
>>
>>
>> CREATE TEMPORARY FUNCTION NVL2 AS 'khanolkar.HiveUDFs.NVL2GenericUDF'
>>
>> I only changed some name wrt  the posted examples, but I think the class
>> was found just right because 1)There's no errors in the log or console 2)I
>> can generate a class not found error mistyping the class name, and I see it
>> in the logs 3) I can use the reflect builtin to invoke a different function
>> that I wrote and supplied to spark in the same way (--jars option to
>> start-thriftserver)
>>
>> After this, I can't use the NVL2 function in a query and I can't even do
>> a  DESCRIBE query on it,  nor does it list with SHOW FUNCTIONS. I tried
>> both 1.5.1 and 1.6.0-rc2 built with thriftserver support for Hadoop 2.6
>>
>> I know the HiveContext is slightly behind the latest Hive as far as
>> features, I believe one or two revs, so that may be one potential problem,
>> but all these feature I believe are present in Hive 0.11 and should have
>> made it into Spark. At the very least, I would like to see some message in
>> the logs and console so that I can find the error of my ways, repent and
>> fix my code. Any suggestions? Anything I should post to support
>> troubleshooting? Is this JIRA-worthy? Thanks
>>
>> Antonio
>>
>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Error while running a job in yarn-client mode

2015-12-16 Thread Ted Yu
Do you mind sharing code snippet so that we have more clue ?

Thanks

On Wed, Dec 16, 2015 at 8:37 AM, sunil m <260885smanik...@gmail.com> wrote:

> Hello Spark experts!
>
>
> I am using spark 1.5.1 and get the following exception  while running
> sample applications
>
> Any tips/ hints on how to solve the error below will be of great help!
>
>
> _
> *Exception in thread "main" java.lang.IllegalStateException: Cannot call
> methods on a stopped SparkContext*
> at org.apache.spark.SparkContext.org
> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
> at
> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:2052)
> at
> org.apache.spark.api.java.JavaSparkContext.parallelizePairs(JavaSparkContext.scala:169)
> at
> com.slb.sis.bolt.test.samples.SparkAPIExamples.mapsFromPairsToPairs(SparkAPIExamples.java:156)
> at
> com.slb.sis.bolt.test.samples.SparkAPIExamples.main(SparkAPIExamples.java:33)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> 
>
> Thanks in advance.
>
> Warm regards,
> Sunil M.
>


Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Saiph Kappa
Exactly, but it's only fixed for the next spark version. Is there any work
around for version 1.5.2?

On Wed, Dec 16, 2015 at 4:36 PM, Ted Yu  wrote:

> This seems related:
> [SPARK-10123][DEPLOY] Support specifying deploy mode from configuration
>
> FYI
>
> On Wed, Dec 16, 2015 at 7:31 AM, Saiph Kappa 
> wrote:
>
>> Hi,
>>
>> I have a client application running on host0 that is launching multiple
>> drivers on multiple remote standalone spark clusters (each cluster is
>> running on a single machine):
>>
>> «
>> ...
>>
>> List("host1", "host2" , "host3").foreach(host => {
>>
>> val sparkConf = new SparkConf()
>> sparkConf.setAppName("App")
>>
>> sparkConf.set("spark.driver.memory", "4g")
>> sparkConf.set("spark.executor.memory", "4g")
>> sparkConf.set("spark.driver.maxResultSize", "4g")
>> sparkConf.set("spark.serializer", 
>> "org.apache.spark.serializer.KryoSerializer")
>> sparkConf.set("spark.executor.extraJavaOptions", " -XX:+UseCompressedOops 
>> -XX:+UseConcMarkSweepGC " +
>>   "-XX:+AggressiveOpts -XX:FreqInlineSize=300 -XX:MaxInlineSize=300 ")
>>
>> sparkConf.setMaster(s"spark://$host:7077")
>>
>> val rawStreams = (1 to source.parallelism).map(_ => 
>> ssc.textFileStream("/home/user/data/")).toArray
>> val rawStream = ssc.union(rawStreams)
>> rawStream.count.map(c => s"Received $c records.").print()
>>
>> }
>> ...
>>
>> »
>>
>> The problem is that I'm getting an error message saying that the directory 
>> "/home/user/data/" does not exist.
>> In fact, this directory only exists in host1, host2 and host3 and not in 
>> host0.
>> But since I'm launching the driver to host1..3 I thought data would be 
>> fetched from those machines.
>>
>> I'm also trying to avoid using the spark submit script, and couldn't find 
>> the configuration parameter to specify the deploy mode.
>>
>> Is there any way to specify the deploy mode through configuration parameter?
>>
>>
>> Thanks.
>>
>>
>


Re: File not found error running query in spark-shell

2015-12-16 Thread Ted Yu
The first run actually worked. It was the amount of exceptions preceding
the result that surprised me.

I want to see if there is a way of getting rid of the exceptions.

Thanks

On Wed, Dec 16, 2015 at 10:53 AM, Jakob Odersky  wrote:

> When you re-run the last statement a second time, does it work? Could it
> be related to https://issues.apache.org/jira/browse/SPARK-12350 ?
>
> On 16 December 2015 at 10:39, Ted Yu  wrote:
>
>> Hi,
>> I used the following command on a recently refreshed checkout of master
>> branch:
>>
>> ~/apache-maven-3.3.3/bin/mvn -Phive -Phive-thriftserver -Pyarn
>> -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests
>>
>> I was then running simple query in spark-shell:
>> Seq(
>>   (83, 0, 38),
>>   (26, 0, 79),
>>   (43, 81, 24)
>> ).toDF("a", "b", "c").registerTempTable("cachedData")
>>
>> sqlContext.cacheTable("cachedData")
>> sqlContext.sql("select * from cachedData").show
>>
>> However, I encountered errors in the following form:
>>
>> http://pastebin.com/QeiwJpwi
>>
>> Under workspace, I found:
>>
>> ./sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedClass.class
>>
>> but no ByteOrder.class.
>>
>> Did I miss some step(s) ?
>>
>> Thanks
>>
>
>


HiveContext Self join not reading from cache

2015-12-16 Thread Gourav Sengupta
Hi,

This is how the data  can be created:

1. TableA : cached()
2. TableB : cached()
3. TableC: TableA inner join TableB cached()
4. TableC join TableC does not take the data from cache but starts reading
the data for TableA and TableB from disk.

Does this sound like a bug? The self join between TableA and TableB are
working fine and taking data from cache.


Regards,
Gourav


Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread Daniel Valdivia
Hi Abhishek,

Thanks for your suggestion, I did considered it, but I'm not sure if to
achieve that I'd ned to collect() the data first, I don't think it would
fit into the Driver memory.

Since I'm trying all of this inside the pyspark shell I'm using a small
dataset, however the main dataset is about 1.5gb of data, and my cluster
has only 2gb of ram nodes (2 of them).

Do you think that your suggestion could work without having to collect()
the results?

Thanks in advance!

On Wed, Dec 16, 2015 at 4:26 AM, Abhishek Shivkumar <
abhisheksgum...@gmail.com> wrote:

> Hello Daniel,
>
>   I was thinking if you can write
>
> catGroupArr.map(lambda line: create_and_write_file(line))
>
> def create_and_write_file(line):
>
> 1. look at the key of line: line[0]
> 2. Open a file with required file name based on key
> 3. iterate through the values of this key,value pair
>
>for ele in line[1]:
>
> 4. Write every ele into the file created.
> 5. Close the file.
>
> Do you think this works?
>
> Thanks
> Abhishek S
>
>
> Thank you!
>
> With Regards,
> Abhishek S
>
> On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia 
> wrote:
>
>> Hello everyone,
>>
>> I have a PairRDD with a set of key and list of values, each value in the
>> list is a json which I already loaded beginning of my spark app, how can I
>> iterate over each value of the list in my pair RDD to transform it to a
>> string then save the whole content of the key to a file? one file per key
>>
>> my input files look like cat-0-500.txt:
>>
>> *{cat:'red',value:'asd'}*
>> *{cat:'green',value:'zxc'}*
>> *{cat:'red',value:'jkl'}*
>>
>> The PairRDD looks like
>>
>> *('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])*
>> *('green', [{cat:'green',value:'zxc'}])*
>>
>> so as you can see I I'd like to serialize each json in the value list
>> back to string so I can easily saveAsTextFile(), ofcourse I'm trying to
>> save a separate file for each key
>>
>> The way I got here:
>>
>> *rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")*
>> *import json*
>> *categoriesJson = rawcatRdd.map(lambda x: json.loads(x))*
>> *categories = categoriesJson*
>>
>> *catByDate = categories.map(lambda x: (x['cat'], x)*
>> *catGroup = catByDate.groupByKey()*
>> *catGroupArr = catGroup.mapValues(lambda x : list(x))*
>>
>> Ideally I want to create a cat-red.txt that looks like:
>>
>> {cat:'red',value:'asd'}
>> {cat:'red',value:'jkl'}
>>
>> and the same for the rest of the keys.
>>
>> I already looked at this answer
>> 
>>  but
>> I'm slightly lost as host to process each value in the list (turn into
>> string) before I save the contents to a file, also I cannot figure out how
>> to import *MultipleTextOutputFormat* in python either.
>>
>> I'm trying all this wacky stuff in the pyspark shell
>>
>> Any advice would be greatly appreciated
>>
>> Thanks in advance!
>>
>
>


Re: NPE in using AvroKeyValueInputFormat for newAPIHadoopFile

2015-12-16 Thread Igor Berman
check version compatibility
I think avro lib should be 1.7.4
check that no other lib brings transitive dependency of other avro version


On 16 December 2015 at 09:44, Jinyuan Zhou  wrote:

> Hi, I tried to load avro files in hdfs but keep getting NPE.
>  I am using AvroKeyValueInputFormat inside newAPIHadoopFile method. Anyone
> have any clue? Here is stack trace
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent
> failure: Lost task 4.3 in stage 0.0 (TID 11, xyz.abc.com):
> java.lang.NullPointerException
>
> at org.apache.avro.Schema.getAliases(Schema.java:1415)
>
> at org.apache.avro.Schema.getAliases(Schema.java:1429)
>
> at org.apache.avro.Schema.applyAliases(Schema.java:1340)
>
> at
> org.apache.avro.generic.GenericDatumReader.getResolver(GenericDatumReader.java:125)
>
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:140)
>
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
>
> at
> org.apache.avro.mapreduce.AvroRecordReaderBase.nextKeyValue(AvroRecordReaderBase.java:118)
>
> at
> org.apache.avro.mapreduce.AvroKeyValueRecordReader.nextKeyValue(AvroKeyValueRecordReader.java:62)
>
> at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
>
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
>
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:744)
>
>
> Thanks,
>
> Jack
> Jinyuan (Jack) Zhou
>


Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread abhisheksgumadi
Hi Daniel

   Yes it will work without the collect method. You just do a map operation on 
every item of the RDD. 

Thanks
Abhishek S

> On 16 Dec 2015, at 18:10, Daniel Valdivia  wrote:
> 
> Hi Abhishek,
> 
> Thanks for your suggestion, I did considered it, but I'm not sure if to 
> achieve that I'd ned to collect() the data first, I don't think it would fit 
> into the Driver memory.
> 
> Since I'm trying all of this inside the pyspark shell I'm using a small 
> dataset, however the main dataset is about 1.5gb of data, and my cluster has 
> only 2gb of ram nodes (2 of them).
> 
> Do you think that your suggestion could work without having to collect() the 
> results?
> 
> Thanks in advance!
> 
>> On Wed, Dec 16, 2015 at 4:26 AM, Abhishek Shivkumar 
>>  wrote:
>> Hello Daniel,
>> 
>>   I was thinking if you can write 
>> 
>> catGroupArr.map(lambda line: create_and_write_file(line))
>> 
>> def create_and_write_file(line):
>> 
>> 1. look at the key of line: line[0]
>> 2. Open a file with required file name based on key
>> 3. iterate through the values of this key,value pair
>> 
>>for ele in line[1]:
>> 
>> 4. Write every ele into the file created.
>> 5. Close the file.
>> 
>> Do you think this works?
>> 
>> Thanks
>> Abhishek S
>> 
>> 
>> Thank you!
>> 
>> With Regards,
>> Abhishek S
>> 
>>> On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia  
>>> wrote:
>>> Hello everyone,
>>> 
>>> I have a PairRDD with a set of key and list of values, each value in the 
>>> list is a json which I already loaded beginning of my spark app, how can I 
>>> iterate over each value of the list in my pair RDD to transform it to a 
>>> string then save the whole content of the key to a file? one file per key
>>> 
>>> my input files look like cat-0-500.txt:
>>> 
>>> {cat:'red',value:'asd'}
>>> {cat:'green',value:'zxc'}
>>> {cat:'red',value:'jkl'}
>>> 
>>> The PairRDD looks like
>>> 
>>> ('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])
>>> ('green', [{cat:'green',value:'zxc'}])
>>> 
>>> so as you can see I I'd like to serialize each json in the value list back 
>>> to string so I can easily saveAsTextFile(), ofcourse I'm trying to save a 
>>> separate file for each key
>>> 
>>> The way I got here:
>>> 
>>> rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")
>>> import json
>>> categoriesJson = rawcatRdd.map(lambda x: json.loads(x))
>>> categories = categoriesJson
>>> 
>>> catByDate = categories.map(lambda x: (x['cat'], x)
>>> catGroup = catByDate.groupByKey()
>>> catGroupArr = catGroup.mapValues(lambda x : list(x))
>>> 
>>> Ideally I want to create a cat-red.txt that looks like:
>>> 
>>> {cat:'red',value:'asd'}
>>> {cat:'red',value:'jkl'}
>>> 
>>> and the same for the rest of the keys.
>>> 
>>> I already looked at this answer but I'm slightly lost as host to process 
>>> each value in the list (turn into string) before I save the contents to a 
>>> file, also I cannot figure out how to import MultipleTextOutputFormat in 
>>> python either.
>>> 
>>> I'm trying all this wacky stuff in the pyspark shell
>>> 
>>> Any advice would be greatly appreciated
>>> 
>>> Thanks in advance!
> 


Re: Preventing an RDD from shuffling

2015-12-16 Thread Igor Berman
imho, you should implement your own rdd with mongo sharding awareness, then
this rdd will have this mongo aware partitioner, and then incoming data
will be partitioned by this partitioner in join
not sure if it's simple task...but you have to get partitioner in you mongo
rdd.

On 16 December 2015 at 12:23, sparkuser2345  wrote:

> Is there a way to prevent an RDD from shuffling in a join operation without
> repartitioning it?
>
> I'm reading an RDD from sharded MongoDB, joining that with an RDD of
> incoming data (+ some additional calculations), and writing the resulting
> RDD back to MongoDB. It would make sense to shuffle only the incoming data
> RDD so that the joined RDD would already be partitioned correctly according
> to the MondoDB shard key.
>
> I know I can prevent an RDD from shuffling in a join operation by
> partitioning it beforehand but partitioning would already shuffle the RDD.
> In addition, I'm only doing the join once per RDD read from MongoDB. Is
> there a way to tell Spark to shuffle only the incoming data RDD?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Preventing-an-RDD-from-shuffling-tp25717.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: NPE in using AvroKeyValueInputFormat for newAPIHadoopFile

2015-12-16 Thread Ted Yu
>From Spark's root pom.xml :

 1.7.7

FYI

On Wed, Dec 16, 2015 at 3:06 PM, Igor Berman  wrote:

> check version compatibility
> I think avro lib should be 1.7.4
> check that no other lib brings transitive dependency of other avro version
>
>
> On 16 December 2015 at 09:44, Jinyuan Zhou  wrote:
>
>> Hi, I tried to load avro files in hdfs but keep getting NPE.
>>  I am using AvroKeyValueInputFormat inside newAPIHadoopFile method.
>> Anyone have any clue? Here is stack trace
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent
>> failure: Lost task 4.3 in stage 0.0 (TID 11, xyz.abc.com):
>> java.lang.NullPointerException
>>
>> at org.apache.avro.Schema.getAliases(Schema.java:1415)
>>
>> at org.apache.avro.Schema.getAliases(Schema.java:1429)
>>
>> at org.apache.avro.Schema.applyAliases(Schema.java:1340)
>>
>> at
>> org.apache.avro.generic.GenericDatumReader.getResolver(GenericDatumReader.java:125)
>>
>> at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:140)
>>
>> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
>>
>> at
>> org.apache.avro.mapreduce.AvroRecordReaderBase.nextKeyValue(AvroRecordReaderBase.java:118)
>>
>> at
>> org.apache.avro.mapreduce.AvroKeyValueRecordReader.nextKeyValue(AvroKeyValueRecordReader.java:62)
>>
>> at
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
>>
>> at
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>>
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>>
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>>
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> at java.lang.Thread.run(Thread.java:744)
>>
>>
>> Thanks,
>>
>> Jack
>> Jinyuan (Jack) Zhou
>>
>
>


Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Todd Nist
Another possible alternative is to register a StreamingListener and then
reference the BatchInfo.numRecords; good example here,
https://gist.github.com/akhld/b10dc491aad1a2007183.

After registering the listener, Simply implement the appropriate "onEvent"
method where onEvent is onBatchStarted, onBatchCompleted, ..., for example:

public void onBatchCompleted(StreamingListenerBatchCompleted batchCompleted)
{ System.out.println("Batch completed, Total records :" + batchCompleted.
batchInfo().numRecords().get().toString()); } That should be very efficient
and avoid any collects(), just to obtain the count of records on the
DStream.

HTH.

-Todd

On Wed, Dec 16, 2015 at 3:34 PM, Bryan Cutler  wrote:

> To follow up with your other issue, if you are just trying to count
> elements in a DStream, you can do that without an Accumulator.  foreachRDD
> is meant to be an output action, it does not return anything and it is
> actually run in the driver program.  Because Java (before 8) handles
> closures a little differently, it might be easiest to implement the
> function to pass to foreachRDD as something like this:
>
> class MyFunc implements VoidFunction {
>
>   public long total = 0;
>
>   @Override
>   public void call(JavaRDD rdd) {
> System.out.println("foo " + rdd.collect().toString());
> total += rdd.count();
>   }
> }
>
> MyFunc f = new MyFunc();
>
> inputStream.foreachRDD(f);
>
> // f.total will have the count of all RDDs
>
> Hope that helps some!
>
> -bryan
>
> On Wed, Dec 16, 2015 at 8:37 AM, Bryan Cutler  wrote:
>
>> Hi Andy,
>>
>> Regarding the foreachrdd return value, this Jira that will be in 1.6
>> should take care of that https://issues.apache.org/jira/browse/SPARK-4557
>> and make things a little simpler.
>> On Dec 15, 2015 6:55 PM, "Andy Davidson" 
>> wrote:
>>
>>> I am writing  a JUnit test for some simple streaming code. I want to
>>> make assertions about how many things are in a given JavaDStream. I wonder
>>> if there is an easier way in Java to get the count?
>>>
>>> I think there are two points of friction.
>>>
>>>
>>>1. is it easy to create an accumulator of type double or int, How
>>>ever Long is not supported
>>>2. We need to use javaDStream.foreachRDD. The Function interface
>>>must return void. I was not able to define an accumulator in my driver
>>>and use a lambda function. (I am new to lambda in Java)
>>>
>>> Here is a little lambda example that logs my test objects. I was not
>>> able to figure out how to get  to return a value or access a accumulator
>>>
>>>data.foreachRDD(rdd -> {
>>>
>>> logger.info(“Begin data.foreachRDD" );
>>>
>>> for (MyPojo pojo : rdd.collect()) {
>>>
>>> logger.info("\n{}", pojo.toString());
>>>
>>> }
>>>
>>> return null;
>>>
>>> });
>>>
>>>
>>> Any suggestions would be greatly appreciated
>>>
>>> Andy
>>>
>>> This following code works in my driver but is a lot of code for such a
>>> trivial computation. Because it needs to the JavaSparkContext I do not
>>> think it would work inside a closure. I assume the works do not have access
>>> to the context as a global and that it shipping it in the closure is not a
>>> good idea?
>>>
>>> public class JavaDStreamCount implements Serializable {
>>>
>>> private static final long serialVersionUID = -3600586183332429887L;
>>>
>>> public static Logger logger =
>>> LoggerFactory.getLogger(JavaDStreamCount.class);
>>>
>>>
>>>
>>> public Double hack(JavaSparkContext sc, JavaDStream javaDStream)
>>> {
>>>
>>> Count c = new Count(sc);
>>>
>>> javaDStream.foreachRDD(c);
>>>
>>> return c.getTotal().value();
>>>
>>> }
>>>
>>>
>>>
>>> class Count implements Function {
>>>
>>> private static final long serialVersionUID =
>>> -5239727633710162488L;
>>>
>>> Accumulator total;
>>>
>>>
>>>
>>> public Count(JavaSparkContext sc) {
>>>
>>> total = sc.accumulator(0.0);
>>>
>>> }
>>>
>>>
>>>
>>> @Override
>>>
>>> public java.lang.Void call(JavaRDD rdd) throws Exception {
>>>
>>> List data = rdd.collect();
>>>
>>> int dataSize = data.size();
>>>
>>> logger.error("data.size:{}", dataSize);
>>>
>>> long num = rdd.count();
>>>
>>> logger.error("num:{}", num);
>>>
>>> total.add(new Double(num));
>>>
>>> return null;
>>>
>>> }
>>>
>>>
>>> public Accumulator getTotal() {
>>>
>>> return total;
>>>
>>> }
>>>
>>> }
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>


Re: Scala VS Java VS Python

2015-12-16 Thread Gary Struthers
“Learning Spark” code examples are in Scala, Java, and Python and is an 
excellent book both for learning Spark and for showing just enough Scala to 
program Spark. 
http://shop.oreilly.com/product/0636920028512.do 


Then compare the api for a Spark RDD and a Scala collection. Spark is Scala’s 
killer app as Martin Ordersky explains in this talk
https://www.youtube.com/watch?v=NW5h8d_ZyOs 


Gary Struthers

Re: Scala VS Java VS Python

2015-12-16 Thread Stephen Boesch
There are solid reasons to have built spark on the jvm vs python. The
question for Daniel appear to be at this point scala vs java8. For that
there are many comparisons already available: but in the case of working
with spark there is the additional benefit for the scala side that the core
libraries are in that language.

2015-12-16 13:41 GMT-08:00 Darren Govoni :

> I use python too. I'm actually surprises it's not the primary language
> since it is by far more used in data science than java snd Scala combined.
>
> If I had a second choice of script language for general apps I'd want
> groovy over scala.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Daniel Lopes 
> Date: 12/16/2015 4:16 PM (GMT-05:00)
> To: Daniel Valdivia 
> Cc: user 
> Subject: Re: Scala VS Java VS Python
>
> For me Scala is better like Spark is written in Scala, and I like python
> cuz I always used python for data science. :)
>
> On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia 
> wrote:
>
>> Hello,
>>
>> This is more of a "survey" question for the community, you can reply to
>> me directly so we don't flood the mailing list.
>>
>> I'm having a hard time learning Spark using Python since the API seems to
>> be slightly incomplete, so I'm looking at my options to start doing all my
>> apps in either Scala or Java, being a Java Developer, java 1.8 looks like
>> the logical way, however I'd like to ask here what's the most common (Scala
>> Or Java) since I'm observing mixed results in the social documentation,
>> however Scala seems to be the predominant language for spark examples.
>>
>> Thank for the advice
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> *Daniel Lopes, B.Eng*
> Data Scientist - BankFacil
> CREA/SP 5069410560
> 
> Mob +55 (18) 99764-2733 
> Ph +55 (11) 3522-8009
> http://about.me/dannyeuu
>
> Av. Nova Independência, 956, São Paulo, SP
> Bairro Brooklin Paulista
> CEP 04570-001
> https://www.bankfacil.com.br
>
>


Re: Kafka - streaming from multiple topics

2015-12-16 Thread jpocalan
Nevermind, I found the answer to my questions.
The following spark configuration property will allow you to process
multiple KafkaDirectStream in parallel:
--conf spark.streaming.concurrentJobs=





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
For future reference, this should be fixed with PR #10337 (
https://github.com/apache/spark/pull/10337)

On 16 December 2015 at 11:01, Jakob Odersky  wrote:

> Yeah, the same kind of error actually happens in the JIRA. It actually
> succeeds but a load of exceptions are thrown. Subsequent runs don't produce
> any errors anymore.
>
> On 16 December 2015 at 10:55, Ted Yu  wrote:
>
>> The first run actually worked. It was the amount of exceptions preceding
>> the result that surprised me.
>>
>> I want to see if there is a way of getting rid of the exceptions.
>>
>> Thanks
>>
>> On Wed, Dec 16, 2015 at 10:53 AM, Jakob Odersky 
>> wrote:
>>
>>> When you re-run the last statement a second time, does it work? Could it
>>> be related to https://issues.apache.org/jira/browse/SPARK-12350 ?
>>>
>>> On 16 December 2015 at 10:39, Ted Yu  wrote:
>>>
 Hi,
 I used the following command on a recently refreshed checkout of master
 branch:

 ~/apache-maven-3.3.3/bin/mvn -Phive -Phive-thriftserver -Pyarn
 -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests

 I was then running simple query in spark-shell:
 Seq(
   (83, 0, 38),
   (26, 0, 79),
   (43, 81, 24)
 ).toDF("a", "b", "c").registerTempTable("cachedData")

 sqlContext.cacheTable("cachedData")
 sqlContext.sql("select * from cachedData").show

 However, I encountered errors in the following form:

 http://pastebin.com/QeiwJpwi

 Under workspace, I found:

 ./sql/catalyst/target/scala-2.10/classes/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedClass.class

 but no ByteOrder.class.

 Did I miss some step(s) ?

 Thanks

>>>
>>>
>>
>


Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2015-12-16 Thread Bartłomiej Alberski
First of all , thanks @tdas for looking into my problem.

Yes, I checked it seperately and it is working fine. For below piece of
code there is no single exception and values are sent correctly.

val reporter = new MyClassReporter(...)
reporter.send(...)
val out = new FileOutputStream("out123.txt")
val outO = new ObjectOutputStream(out)
outO.writeObject(reporter)
outO.flush()
outO.close()

val in = new FileInputStream("out123.txt")
val inO = new ObjectInputStream(in)
val reporterFromFile  =
inO.readObject().asInstanceOf[StreamingGraphiteReporter]
reporterFromFile.send(...)
in.close()

Maybe I am wrong but I think that it will be strange if class implementing
Serializable and properly broadcasted to executors cannot be serialized and
deserialized?
I also prepared slightly different piece of code and I received slightly
different exception. Right now it looks like:
java.lang.ClassCastException: [B cannot be cast to com.example.sender.
MyClassReporter.

Maybe I am wrong but, it looks like that when restarting from checkpoint it
does read proper block of memory to read bytes for MyClassReporter.

2015-12-16 2:38 GMT+01:00 Tathagata Das :

> Could you test serializing and deserializing the MyClassReporter  class
> separately?
>
> On Mon, Dec 14, 2015 at 8:57 AM, Bartłomiej Alberski 
> wrote:
>
>> Below is the full stacktrace(real names of my classes were changed) with
>> short description of entries from my code:
>>
>> rdd.mapPartitions{ partition => //this is the line to which second
>> stacktrace entry is pointing
>>   val sender =  broadcastedValue.value // this is the maing place to
>> which first stacktrace entry is pointing
>> }
>>
>> java.lang.ClassCastException:
>> org.apache.spark.util.SerializableConfiguration cannot be cast to
>> com.example.sender.MyClassReporter
>> at com.example.flow.Calculator
>> $$anonfun$saveAndLoadHll$1$$anonfun$apply$14.apply(AnomalyDetection.scala:87)
>> at com.example.flow.Calculator
>> $$anonfun$saveAndLoadHll$1$$anonfun$apply$14.apply(Calculator.scala:82)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> 2015-12-14 17:10 GMT+01:00 Ted Yu :
>>
>>> Can you show the complete stack trace for the ClassCastException ?
>>>
>>> Please see the following thread:
>>> http://search-hadoop.com/m/q3RTtgEUHVmJA1T1
>>>
>>> Cheers
>>>
>>> On Mon, Dec 14, 2015 at 7:33 AM, alberskib  wrote:
>>>
 Hey all,

 When my streaming application is restarting from failure (from
 checkpoint) I
 am receiving strange error:

 java.lang.ClassCastException:
 org.apache.spark.util.SerializableConfiguration cannot be cast to
 com.example.sender.MyClassReporter.

 Instance of B class is created on driver side (with proper config
 passed as
 constructor arg) and broadcasted to the executors in order to ensure
 that on
 each worker there will be only single instance. Everything is going
 well up
 to place where I am getting value of broadcasted field and executing
 function on it i.e.
 broadcastedValue.value.send(...)

 Below you can find definition of MyClassReporter (with trait):

 trait Reporter{
   def send(name: String, value: String, timestamp: Long) : Unit
   def flush() : Unit
 }

 class MyClassReporter(config: MyClassConfig, flow: String) extends
 Reporter
 with Serializable {

   val prefix = s"${config.senderConfig.prefix}.$flow"

   var counter = 0

   @transient
   private lazy val sender : GraphiteSender = initialize()

   @transient
   private lazy val threadPool =

 ExecutionContext.fromExecutorService(Executors.newSingleThreadExecutor())

   private def initialize() = {
   val sender = new Sender(
 new InetSocketAddress(config.senderConfig.hostname,
 config.senderConfig.port)
   )

Re: Scala VS Java VS Python

2015-12-16 Thread Darren Govoni


I use python too. I'm actually surprises it's not the primary language since it 
is by far more used in data science than java snd Scala combined.
If I had a second choice of script language for general apps I'd want groovy 
over scala.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Daniel Lopes  
Date: 12/16/2015  4:16 PM  (GMT-05:00) 
To: Daniel Valdivia  
Cc: user  
Subject: Re: Scala VS Java VS Python 

For me Scala is better like Spark is written in Scala, and I like python cuz I 
always used python for data science. :)
On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia  
wrote:
Hello,



This is more of a "survey" question for the community, you can reply to me 
directly so we don't flood the mailing list.



I'm having a hard time learning Spark using Python since the API seems to be 
slightly incomplete, so I'm looking at my options to start doing all my apps in 
either Scala or Java, being a Java Developer, java 1.8 looks like the logical 
way, however I'd like to ask here what's the most common (Scala Or Java) since 
I'm observing mixed results in the social documentation, however Scala seems to 
be the predominant language for spark examples.



Thank for the advice

-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org






-- 
Daniel Lopes, B.EngData Scientist - BankFacilCREA/SP 5069410560Mob +55 (18) 
99764-2733Ph +55 (11) 3522-8009http://about.me/dannyeuu
Av. Nova Independência, 956, São Paulo, SPBairro Brooklin PaulistaCEP 
04570-001https://www.bankfacil.com.br




Re: Scala VS Java VS Python

2015-12-16 Thread Lan Jiang
For Spark data science project, Python might be a good choice. However, for 
Spark streaming, Python API is still lagging. For example, for Kafka no 
receiver connector, according to the Spark 1.5.2 doc:  "Spark 1.4 added a 
Python API, but it is not yet at full feature parity”. 

Java does not have REPL shell, which is a major drawback from my perspective.

Lan



> On Dec 16, 2015, at 3:46 PM, Stephen Boesch  wrote:
> 
> There are solid reasons to have built spark on the jvm vs python. The 
> question for Daniel appear to be at this point scala vs java8. For that there 
> are many comparisons already available: but in the case of working with spark 
> there is the additional benefit for the scala side that the core libraries 
> are in that language.  
> 
> 2015-12-16 13:41 GMT-08:00 Darren Govoni  >:
> I use python too. I'm actually surprises it's not the primary language since 
> it is by far more used in data science than java snd Scala combined.
> 
> If I had a second choice of script language for general apps I'd want groovy 
> over scala.
> 
> 
> 
> Sent from my Verizon Wireless 4G LTE smartphone
> 
> 
>  Original message 
> From: Daniel Lopes > 
> Date: 12/16/2015 4:16 PM (GMT-05:00) 
> To: Daniel Valdivia  > 
> Cc: user > 
> Subject: Re: Scala VS Java VS Python 
> 
> For me Scala is better like Spark is written in Scala, and I like python cuz 
> I always used python for data science. :)
> 
> On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia  > wrote:
> Hello,
> 
> This is more of a "survey" question for the community, you can reply to me 
> directly so we don't flood the mailing list.
> 
> I'm having a hard time learning Spark using Python since the API seems to be 
> slightly incomplete, so I'm looking at my options to start doing all my apps 
> in either Scala or Java, being a Java Developer, java 1.8 looks like the 
> logical way, however I'd like to ask here what's the most common (Scala Or 
> Java) since I'm observing mixed results in the social documentation, however 
> Scala seems to be the predominant language for spark examples.
> 
> Thank for the advice
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 
> 
> 
> -- 
> Daniel Lopes, B.Eng
> Data Scientist - BankFacil
> CREA/SP 5069410560 
> 
> Mob +55 (18) 99764-2733 
> Ph +55 (11) 3522-8009
> http://about.me/dannyeuu 
> 
> Av. Nova Independência, 956, São Paulo, SP
> Bairro Brooklin Paulista
> CEP 04570-001
> https://www.bankfacil.com.br 
> 
> 



RE: ideal number of executors per machine

2015-12-16 Thread Bui, Tri
Article below gives a good idea.

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Play around with two configuration (large number of executor with small core, 
and small executor with large core) .  Calculated value have to be conservative 
or it will make the spark jobs unstable.

Thx
tri

From: Veljko Skarich [mailto:veljko.skar...@gmail.com]
Sent: Tuesday, December 15, 2015 3:08 PM
To: user@spark.apache.org
Subject: ideal number of executors per machine

Hi,

I'm looking for suggestions on the ideal number of executors per machine. I run 
my jobs on 64G 32 core machines, and at the moment I have one executor running 
per machine, on the spark standalone cluster.

 I could not find many guidelines for figuring out the ideal number of 
executors; the Spark official documentation merely recommends not having more 
than 64G per executor to avoid GC issues. Anyone have and advice on this?

thank you.


Re: Scala VS Java VS Python

2015-12-16 Thread Veljko Skarich
I use Scala, since I was most familiar with it out of the three languages,
when I started using Spark. I would say that learning Scala with no
functional programming background is somewhat challenging, but well worth
it if you have the time. As others have pointed out, using the REPL and
inspecting the core libraries are great points in Scala's favor here.

On Wed, Dec 16, 2015 at 1:53 PM, Lan Jiang  wrote:

> For Spark data science project, Python might be a good choice. However,
> for Spark streaming, Python API is still lagging. For example, for Kafka no
> receiver connector, according to the Spark 1.5.2 doc:  "Spark 1.4 added a
> Python API, but it is not yet at full feature parity”.
>
> Java does not have REPL shell, which is a major drawback from my
> perspective.
>
> Lan
>
>
>
>
> On Dec 16, 2015, at 3:46 PM, Stephen Boesch  wrote:
>
> There are solid reasons to have built spark on the jvm vs python. The
> question for Daniel appear to be at this point scala vs java8. For that
> there are many comparisons already available: but in the case of working
> with spark there is the additional benefit for the scala side that the core
> libraries are in that language.
>
> 2015-12-16 13:41 GMT-08:00 Darren Govoni :
>
>> I use python too. I'm actually surprises it's not the primary language
>> since it is by far more used in data science than java snd Scala combined.
>>
>> If I had a second choice of script language for general apps I'd want
>> groovy over scala.
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: Daniel Lopes 
>> Date: 12/16/2015 4:16 PM (GMT-05:00)
>> To: Daniel Valdivia 
>> Cc: user 
>> Subject: Re: Scala VS Java VS Python
>>
>> For me Scala is better like Spark is written in Scala, and I like python
>> cuz I always used python for data science. :)
>>
>> On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia > > wrote:
>>
>>> Hello,
>>>
>>> This is more of a "survey" question for the community, you can reply to
>>> me directly so we don't flood the mailing list.
>>>
>>> I'm having a hard time learning Spark using Python since the API seems
>>> to be slightly incomplete, so I'm looking at my options to start doing all
>>> my apps in either Scala or Java, being a Java Developer, java 1.8 looks
>>> like the logical way, however I'd like to ask here what's the most common
>>> (Scala Or Java) since I'm observing mixed results in the social
>>> documentation, however Scala seems to be the predominant language for spark
>>> examples.
>>>
>>> Thank for the advice
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> *Daniel Lopes, B.Eng*
>> Data Scientist - BankFacil
>> CREA/SP 5069410560
>> 
>> Mob +55 (18) 99764-2733 
>> Ph +55 (11) 3522-8009
>> http://about.me/dannyeuu
>>
>> Av. Nova Independência, 956, São Paulo, SP
>> Bairro Brooklin Paulista
>> CEP 04570-001
>> https://www.bankfacil.com.br
>>
>>
>
>


Re: PySpark Connection reset by peer: socket write error

2015-12-16 Thread Surendran Duraisamy
I came across this issue while running the program in my lapop with a small
data set (around 3.5 MB).

Code is straight forward as follows.

data = sc.textFile("inputfile.txt")

mappedRdd = data.map(*mapFunction*).cache()

model = ALS.train(mappedRdd , 10, 15)

...

mapFunction - is a simple map function where I split the line and
create Rating()


Regards,

Surendran


On Thu, Dec 17, 2015 at 9:35 AM, Vijay Gharge 
wrote:

> Can you elaborate your problem further ? Looking at the error looks like
> you are running on cluster. Also share relevant code for better
> understanding.
>
>
> On Wednesday 16 December 2015, Surendran Duraisamy 
> wrote:
>
>> Hi,
>>
>>
>>
>> I am running ALS to train a data set of around 15 lines in my local
>> machine. When I call train I am getting following exception.
>>
>>
>>
>> *print *mappedRDDs.count() # this prints correct RDD count
>> model = ALS.train(mappedRDDs, 10, 15)
>>
>>
>>
>> 15/12/16 18:43:18 ERROR PythonRDD: Python worker exited unexpectedly
>> (crashed)
>>
>> java.net.SocketException: Connection reset by peer: socket write error
>>
>> at java.net.SocketOutputStream.socketWrite0(Native Method)
>>
>> at
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>>
>> at
>> java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>>
>> at
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>>
>> at
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:121)
>>
>> at
>> java.io.DataOutputStream.write(DataOutputStream.java:107)
>>
>> at
>> java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>>
>> at
>> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
>>
>> at
>> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>>
>> at
>> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>>
>> at
>> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>
>> at
>> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>>
>> at
>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
>>
>> at
>> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
>>
>> at
>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
>>
>> at
>> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
>>
>>
>>
>> I am getting this error only when I use MLlib with pyspark. how to
>> resolve this issue?
>>
>>
>>
>> Regards,
>>
>> Surendran
>>
>>
>>
>
>
> --
> Regards,
> Vijay Gharge
>
>
>
>


MLlib: Feature Importances API

2015-12-16 Thread Asim Jalis
I wanted to use get feature importances related to a Random Forest as
described in this JIRA: https://issues.apache.org/jira/browse/SPARK-5133

However, I don’t see how to call this. I don't see any methods exposed on

org.apache.spark.mllib.tree.RandomForest

How can I get featureImportances when I generate a RandomForest model in
this code?

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import util.Random

def displayModel(model:RandomForestModel) = {
  // Display model.
  println("Learned classification tree model:\n" + model.toDebugString)
}

def saveModel(model:RandomForestModel,path:String) = {
  // Save and load model.
  model.save(sc, path)
  val sameModel = DecisionTreeModel.load(sc, path)
}

def testModel(model:RandomForestModel,testData:RDD[LabeledPoint]) = {
  // Test model.
  val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
  }
  val testErr = labelAndPreds.
filter(r => r._1 != r._2).count.toDouble / testData.count()
  println("Test Error = " + testErr)
}

def buildModel(trainingData:RDD[LabeledPoint],
  numClasses:Int,categoricalFeaturesInfo:Map[Int,Int]) = {
  val numTrees = 30
  val featureSubsetStrategy = "auto"
  val impurity = "gini"
  val maxDepth = 4
  val maxBins = 32

  // Build model.
  val model = RandomForest.trainClassifier(
trainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth,
maxBins)

  model
}

// Create plain RDD.
val rdd = sc.parallelize(Range(0,1000))

// Convert to LabeledPoint RDD.
val data = rdd.
  map(x => {
val label = x % 2
val feature1 = x % 5
val feature2 = x % 7
val features = Seq(feature1,feature2).
  map(_.toDouble).
  zipWithIndex.
  map(_.swap)
val vector = Vectors.sparse(features.size, features)
val point = new LabeledPoint(label, vector)
point })

// Split data into training (70%) and test (30%).
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Set up parameters for training.
val numClasses = data.map(_.label).distinct.count.toInt
val categoricalFeaturesInfo = Map[Int, Int]()

val model = buildModel(
trainingData,
numClasses,
categoricalFeaturesInfo)
testModel(model,testData)


spark master process shutdown for timeout

2015-12-16 Thread yaoxiaohua
Hi guys,

I have two nodes used as spark master, spark1,spark2

Spark1.4.0

Jdk 1.7 sunjdk

 

Now these days I found that spark2 master process may shutdown , I found
that in log file:

15/12/17 13:09:58 INFO ClientCnxn: Client session timed out, have not heard
from server in 40020ms for sessionid 0x351a889694144b1, closing socket
connection and attempting reconnect

15/12/17 13:09:58 INFO ConnectionStateManager: State change: SUSPENDED

15/12/17 13:09:58 INFO ZooKeeperLeaderElectionAgent: We have lost leadership

15/12/17 13:09:58 ERROR Master: Leadership has been revoked -- master
shutting down.

15/12/17 13:09:58 INFO Utils: Shutdown hook called

 

It looks like timeout , I don't know how to change the configure to avoid
this case happened, please help me.

Thanks.

 

Best Regards,

Evan Yao



Error getting response from spark driver rest APIs : java.lang.IncompatibleClassChangeError: Implementing class

2015-12-16 Thread ihavethepotential
Hi all,

I am trying to get the job details for my spark application using a REST
call to the driver API. I am making a GET request to the following URI 

/api/v1/applications/socketEs2/jobs

But getting the following exception:

015-12-16 19:46:28 qtp1912556493-56 [WARN ] ServletHandler -
/api/v1/applications/socketEs2/jobs
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:728)
at
com.sun.jersey.spi.container.servlet.WebComponent.createResourceConfig(WebComponent.java:678)
at
com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:203)
at
com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:373)
at
com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:556)
at javax.servlet.GenericServlet.init(GenericServlet.java:244)
at
org.spark-project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:532)
at
org.spark-project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:415)
at
org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:657)
at
org.spark-project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at
org.spark-project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at
org.spark-project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at
org.spark-project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at
org.spark-project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:301)
at
org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.spark-project.jetty.server.Server.handle(Server.java:370)
at
org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at
org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at
org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644)
at
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at
org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at
org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IncompatibleClassChangeError: Implementing class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at
com.sun.jersey.api.core.ScanningResourceConfig.init(ScanningResourceConfig.java:79)
at
com.sun.jersey.api.core.PackagesResourceConfig.init(PackagesResourceConfig.java:104)
at
com.sun.jersey.api.core.PackagesResourceConfig.(PackagesResourceConfig.java:78)
at
com.sun.jersey.api.core.PackagesResourceConfig.(PackagesResourceConfig.java:89)
... 33 more

Please help.

Thanks in advance.

Regards,
Rakesh



--
View this message in context: 

Access row column by field name

2015-12-16 Thread Daniel Valdivia
Hi,

I'm processing the json I have in a text file using DataFrames, however right 
now I'm trying to figure out a way to access a certain value within the rows of 
my data frame if I only know the field name and not the respective field 
position in the schema.

I noticed that row.schema and row.dtypes give me information about the 
auto-generate schema, but I cannot see a straigh forward patch for this, I'm 
trying to create a PairRdd out of this 

Is there any easy way to figure out the field position by it's field name (the 
key it had in the json)?

so this

val sqlContext = new SQLContext(sc)
val rawIncRdd = 
sc.textFile("hdfs://1.2.3.4:8020/user/hadoop/incidents/unstructured/inc-0-500.txt")
 val df = sqlContext.jsonRDD(rawIncRdd)
df.foreach(line => println(line.getString(0)))


would turn into something like this

val sqlContext = new SQLContext(sc)
val rawIncRdd = 
sc.textFile("hdfs://1.2.3.4:8020/user/hadoop/incidents/unstructured/inc-0-500.txt")
 val df = sqlContext.jsonRDD(rawIncRdd)
df.foreach(line => println(line.getString("field_name")))

thanks for the advice

Re: Preventing an RDD from shuffling

2015-12-16 Thread Koert Kuipers
a join needs a partitioner, and will shuffle the data as needed for the
given partitioner (or if the data is already partitioned then it will leave
it alone), after which it will process with something like a map-side join.

if you can specify a partitioner that meets the exact layout of your data
in mongo then the shuffle should be harmless for your mongo data (but will
still be done, i don't think there is a way yet to tell spark to trust you
that its already shuffled), and only the other dataset would get shuffled
to meet your mongo layout.

On Wed, Dec 16, 2015 at 5:23 AM, sparkuser2345 
wrote:

> Is there a way to prevent an RDD from shuffling in a join operation without
> repartitioning it?
>
> I'm reading an RDD from sharded MongoDB, joining that with an RDD of
> incoming data (+ some additional calculations), and writing the resulting
> RDD back to MongoDB. It would make sense to shuffle only the incoming data
> RDD so that the joined RDD would already be partitioned correctly according
> to the MondoDB shard key.
>
> I know I can prevent an RDD from shuffling in a join operation by
> partitioning it beforehand but partitioning would already shuffle the RDD.
> In addition, I'm only doing the join once per RDD read from MongoDB. Is
> there a way to tell Spark to shuffle only the incoming data RDD?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Preventing-an-RDD-from-shuffling-tp25717.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Jacek Laskowski
Thanks Mark for the answer! It helps, but still leaves me with few
more questions. If you don't mind, I'd like to ask you few more
questions.

When you said "It can be used, and is used in user code, but it isn't
always as straightforward as you might think." did you think about the
Spark code or some other user code? Can I have a look at the code and
the use case? The method is `private[spark]` and it's not even
@DeveloperApi that makes using the method even more risky. I believe
it's a very low-level ingredient of Spark that very few people use if
at all. If I could see the code that uses the method, that could help.

Following up, isn't killing a stage similar to killing a job? They can
both be shared and I could imagine a very similar case for killing a
job as for a stage where an implementation does some checks before
killing the job eventually. It is possible for stages that are in a
sense similar to jobs so...I'm still unsure why the method is not used
by Spark itself. If it's not used by Spark why could it be useful for
others outside Spark?

Doh, why did I come across the method? It will take some time before I
forget about it :-)

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Wed, Dec 16, 2015 at 10:55 AM, Mark Hamstra  wrote:
> It can be used, and is used in user code, but it isn't always as
> straightforward as you might think.  This is mostly because a Job often
> isn't a Job -- or rather it is more than one Job.  There are several RDD
> transformations that aren't lazy, so they end up launching "hidden" Jobs
> that you may not anticipate and may expect to be canceled (but won't be) by
> a cancelJob() called on a later action on that transformed RDD.  It is also
> possible for a single DataFrame or Spark SQL query to result in more than
> one running Job.  The upshot of all of this is that getting cancelJob() to
> work as most users would expect all the time is non-trivial, and most of the
> time using a jobGroup is a better way to capture what may be more than one
> Job that the user is thinking of as a single Job.
>
> On Wed, Dec 16, 2015 at 5:34 AM, Sean Owen  wrote:
>>
>> It does look like it's not actually used. It may simply be there for
>> completeness, to match cancelStage and cancelJobGroup, which are used.
>> I also don't know of a good reason there's no way to kill a whole job.
>>
>> On Wed, Dec 16, 2015 at 1:15 PM, Jacek Laskowski  wrote:
>> > Hi,
>> >
>> > While reviewing Spark code I came across SparkContext.cancelJob. I
>> > found no part of Spark using it. Is this a leftover after some
>> > refactoring? Why is this part of sc?
>> >
>> > The reason I'm asking is another question I'm having after having
>> > learnt about killing a stage in webUI. I noticed there is a way to
>> > kill/cancel stages, but no corresponding feature to kill/cancel jobs.
>> > Why? Is there a JIRA ticket to have it some day perhaps?
>> >
>> > Pozdrawiam,
>> > Jacek
>> >
>> > --
>> > Jacek Laskowski | https://medium.com/@jaceklaskowski/
>> > Mastering Apache Spark
>> > ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley
This method is tested in the Spark 1.5 unit tests, so I'd guess it's a
problem with the Parquet dependency.  What version of Parquet are you
building Spark 1.5 off of?  (I'm not that familiar with Parquet issues
myself, but hopefully a SQL person can chime in.)

On Tue, Dec 15, 2015 at 3:23 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> I have recently upgraded spark version but when I try to run save a random 
> forest model using model save command I am getting nosuchmethoderror.  My 
> code works fine with 1.3x version.
>
>
>
> model.save(sc.sc(), "modelsavedir");
>
>
>
>
>
> ERROR:
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation -
> Aborting job.
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 22.0 (TID 230, localhost): java.lang.NoSuchMethodError:
> parquet.schema.Types$GroupBuilder.addField(Lparquet/schema/Type;)Lparquet/schema/Types$BaseGroupBuilder;
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:517)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:516)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:516)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>
> at
> org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at
> org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)
>
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:261)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
>
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
>
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>


Spark job dying when I submit through oozie

2015-12-16 Thread Scott Gallavan
I am trying to submit a spark job through oozie, the job is marked
successful, but it does not do anything.

I have it working through spark_submit on the command line.

Also the problem may be around creating the spark context, because I added
logging before/after/finally creating the SparkContext, and I only see
"test 1" in the logs.

Anyone have a suggestion on debugging this?  Or where I may be able to get
additional logs?

try {
  this.log.warning("test 1")
  sc = new SparkContext(conf)
  this.log.warning("test 2")
} finally {
  this.log.warning("test 3")
}


We are using Yarn/CDH5.5

When I compare the logs below is where they start to diverge.


==Broken===
server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started
SelectChannelConnector@0.0.0.0:33827
util.Utils (Logging.scala:logInfo(59)) - Successfully started service
'SparkUI' on port x.
ui.SparkUI (Logging.scala:logInfo(59)) - Started SparkUI at
http://:
cluster.YarnClusterScheduler (Logging.scala:logInfo(59)) - Created
YarnClusterScheduler
metrics.MetricsSystem (Logging.scala:logWarning(71)) - Using default name
DAGScheduler for source because spark.app.id is not set.
util.Utils (Logging.scala:logInfo(59)) - Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 16611.
netty.NettyBlockTransferService (Logging.scala:logInfo(59)) - Server
created on x
storage.BlockManager (Logging.scala:logInfo(59)) - external shuffle service
port = 
storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Trying to register
BlockManager





==Working===
AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
Utils: Successfully started service 'SparkUI' on port 4040.
SparkUI: Started SparkUI at http://xxx:4040
SparkContext: Added JAR ...
SparkContext: Added JAR file:/xxx.jar at http://xxx.jar with
timestamp 1450320527893
MetricsSystem: Using default name DAGScheduler for source because
spark.app.id is not set.
ConfiguredRMFailoverProxyProvider: Failing over to rm238
Client: Requesting a new application from cluster with 25 NodeManagers
Client: Verifying our application has not requested more than the maximum
memory capability of the cluster (65536 MB per container)


Re: PySpark Connection reset by peer: socket write error

2015-12-16 Thread Vijay Gharge
Can you elaborate your problem further ? Looking at the error looks like
you are running on cluster. Also share relevant code for better
understanding.

On Wednesday 16 December 2015, Surendran Duraisamy 
wrote:

> Hi,
>
>
>
> I am running ALS to train a data set of around 15 lines in my local
> machine. When I call train I am getting following exception.
>
>
>
> *print *mappedRDDs.count() # this prints correct RDD count
> model = ALS.train(mappedRDDs, 10, 15)
>
>
>
> 15/12/16 18:43:18 ERROR PythonRDD: Python worker exited unexpectedly
> (crashed)
>
> java.net.SocketException: Connection reset by peer: socket write error
>
> at java.net.SocketOutputStream.socketWrite0(Native Method)
>
> at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>
> at
> java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>
> at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>
> at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:121)
>
> at
> java.io.DataOutputStream.write(DataOutputStream.java:107)
>
> at
> java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>
> at
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
>
> at
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>
> at
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>
> at
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>
> at
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
>
> at
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
>
> at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
>
> at
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
>
>
>
> I am getting this error only when I use MLlib with pyspark. how to resolve
> this issue?
>
>
>
> Regards,
>
> Surendran
>
>
>


-- 
Regards,
Vijay Gharge


Re: Adding a UI Servlet Filter

2015-12-16 Thread Ted Yu
Which Spark release are you using ?
Mind pasting stack trace ?

> On Dec 14, 2015, at 11:34 AM, iamknome  wrote:
> 
> Hello all,
> 
> I am trying to setup a UI filter for the Web UI and trying to add my
> customer auth servlet filter to the worker and master processes. I have
> added the extraClasspath option and have it pointed to my custom JAR but
> when the worker or master starts it keeps complaining about
> ClassNotFoundException.
> 
> What is the recommended approach to do this ? How do i protect my web ui so
> only authorized users have access to it ? How do i get past the class path
> issue ?
> 
> Any help is appreciated.
> 
> thanks
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Adding-a-UI-Servlet-Filter-tp25700.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Access row column by field name

2015-12-16 Thread Jeff Zhang
use Row.getAs[String](fieldname)

On Thu, Dec 17, 2015 at 10:58 AM, Daniel Valdivia 
wrote:

> Hi,
>
> I'm processing the json I have in a text file using DataFrames, however
> right now I'm trying to figure out a way to access a certain value within
> the rows of my data frame if I only know the field name and not the
> respective field position in the schema.
>
> I noticed that row.schema and row.dtypes give me information about the
> auto-generate schema, but I cannot see a straigh forward patch for this,
> I'm trying to create a PairRdd out of this
>
> Is there any easy way to figure out the field position by it's field name
> (the key it had in the json)?
>
> so this
>
> val sqlContext = new SQLContext(sc)
> val rawIncRdd = sc.textFile("
> hdfs://1.2.3.4:8020/user/hadoop/incidents/unstructured/inc-0-500.txt")
>  val df = sqlContext.jsonRDD(rawIncRdd)
> df.foreach(line => println(line.getString(0)))
>
>
> would turn into something like this
>
> val sqlContext = new SQLContext(sc)
> val rawIncRdd = sc.textFile("
> hdfs://1.2.3.4:8020/user/hadoop/incidents/unstructured/inc-0-500.txt")
>  val df = sqlContext.jsonRDD(rawIncRdd)
> df.foreach(line => println(line.getString(*"field_name"*)))
>
> thanks for the advice
>



-- 
Best Regards

Jeff Zhang


Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Sean Owen
It does look like it's not actually used. It may simply be there for
completeness, to match cancelStage and cancelJobGroup, which are used.
I also don't know of a good reason there's no way to kill a whole job.

On Wed, Dec 16, 2015 at 1:15 PM, Jacek Laskowski  wrote:
> Hi,
>
> While reviewing Spark code I came across SparkContext.cancelJob. I
> found no part of Spark using it. Is this a leftover after some
> refactoring? Why is this part of sc?
>
> The reason I'm asking is another question I'm having after having
> learnt about killing a stage in webUI. I noticed there is a way to
> kill/cancel stages, but no corresponding feature to kill/cancel jobs.
> Why? Is there a JIRA ticket to have it some day perhaps?
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/
> Mastering Apache Spark
> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Jeff Zhang
oh, you are using S3. As I remember,  S3 has performance issue when
processing large amount of files.



On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta 
wrote:

> The HIVE table has very large number of partitions around 365 * 5 * 10 and
> when I say hivemetastore to start running queries on it (the one with
> .count() or .show()) then it takes around 2 hours before the job starts in
> SPARK.
>
> On the pyspark screen I can see that it is parsing the S3 locations for
> these 2 hours.
>
> Regards,
> Gourav
>
> On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang  wrote:
>
>> >>> Currently it takes around 1.5 hours for me just to cache in the
>> partition information and after that I can see that the job gets queued in
>> the SPARK UI.
>> I guess you mean the stage of getting the split info. I suspect it might
>> be your cluster issue (or metadata store), unusually it won't take such
>> long time for splitting.
>>
>> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a HIVE table with few thousand partitions (based on date and
>>> time). It takes a long time to run if for the first time and then
>>> subsequently it is fast.
>>>
>>> Is there a way to store the cache of partition lookups so that every
>>> time I start a new SPARK instance (cannot keep my personal server running
>>> continuously), I can immediately restore back the temptable in hiveContext
>>> without asking it go again and cache the partition lookups?
>>>
>>> Currently it takes around 1.5 hours for me just to cache in the
>>> partition information and after that I can see that the job gets queued in
>>> the SPARK UI.
>>>
>>> Regards,
>>> Gourav
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: PairRDD(K, L) to multiple files by key serializing each value in L before

2015-12-16 Thread Abhishek Shivkumar
Hello Daniel,

  I was thinking if you can write

catGroupArr.map(lambda line: create_and_write_file(line))

def create_and_write_file(line):

1. look at the key of line: line[0]
2. Open a file with required file name based on key
3. iterate through the values of this key,value pair

   for ele in line[1]:

4. Write every ele into the file created.
5. Close the file.

Do you think this works?

Thanks
Abhishek S


Thank you!

With Regards,
Abhishek S

On Wed, Dec 16, 2015 at 1:05 AM, Daniel Valdivia 
wrote:

> Hello everyone,
>
> I have a PairRDD with a set of key and list of values, each value in the
> list is a json which I already loaded beginning of my spark app, how can I
> iterate over each value of the list in my pair RDD to transform it to a
> string then save the whole content of the key to a file? one file per key
>
> my input files look like cat-0-500.txt:
>
> *{cat:'red',value:'asd'}*
> *{cat:'green',value:'zxc'}*
> *{cat:'red',value:'jkl'}*
>
> The PairRDD looks like
>
> *('red', [{cat:'red',value:'asd'},{cat:'red',value:'jkl'}])*
> *('green', [{cat:'green',value:'zxc'}])*
>
> so as you can see I I'd like to serialize each json in the value list back
> to string so I can easily saveAsTextFile(), ofcourse I'm trying to save a
> separate file for each key
>
> The way I got here:
>
> *rawcatRdd = sc.textFile("hdfs://x.x.x.../unstructured/cat-0-500.txt")*
> *import json*
> *categoriesJson = rawcatRdd.map(lambda x: json.loads(x))*
> *categories = categoriesJson*
>
> *catByDate = categories.map(lambda x: (x['cat'], x)*
> *catGroup = catByDate.groupByKey()*
> *catGroupArr = catGroup.mapValues(lambda x : list(x))*
>
> Ideally I want to create a cat-red.txt that looks like:
>
> {cat:'red',value:'asd'}
> {cat:'red',value:'jkl'}
>
> and the same for the rest of the keys.
>
> I already looked at this answer
> 
>  but
> I'm slightly lost as host to process each value in the list (turn into
> string) before I save the contents to a file, also I cannot figure out how
> to import *MultipleTextOutputFormat* in python either.
>
> I'm trying all this wacky stuff in the pyspark shell
>
> Any advice would be greatly appreciated
>
> Thanks in advance!
>


Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
Hi Jeff,

sadly that does not resolve the issue. I am sure that the memory mapping to
physical files locations can be saved and recovered in SPARK.


Regards,
Gourav Sengupta

On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhang  wrote:

> oh, you are using S3. As I remember,  S3 has performance issue when
> processing large amount of files.
>
>
>
> On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> The HIVE table has very large number of partitions around 365 * 5 * 10
>> and when I say hivemetastore to start running queries on it (the one with
>> .count() or .show()) then it takes around 2 hours before the job starts in
>> SPARK.
>>
>> On the pyspark screen I can see that it is parsing the S3 locations for
>> these 2 hours.
>>
>> Regards,
>> Gourav
>>
>> On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang  wrote:
>>
>>> >>> Currently it takes around 1.5 hours for me just to cache in the
>>> partition information and after that I can see that the job gets queued in
>>> the SPARK UI.
>>> I guess you mean the stage of getting the split info. I suspect it might
>>> be your cluster issue (or metadata store), unusually it won't take such
>>> long time for splitting.
>>>
>>> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi,

 I have a HIVE table with few thousand partitions (based on date and
 time). It takes a long time to run if for the first time and then
 subsequently it is fast.

 Is there a way to store the cache of partition lookups so that every
 time I start a new SPARK instance (cannot keep my personal server running
 continuously), I can immediately restore back the temptable in hiveContext
 without asking it go again and cache the partition lookups?

 Currently it takes around 1.5 hours for me just to cache in the
 partition information and after that I can see that the job gets queued in
 the SPARK UI.

 Regards,
 Gourav

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


PySpark Connection reset by peer: socket write error

2015-12-16 Thread Surendran Duraisamy
Hi,



I am running ALS to train a data set of around 15 lines in my local
machine. When I call train I am getting following exception.



*print *mappedRDDs.count() # this prints correct RDD count
model = ALS.train(mappedRDDs, 10, 15)



15/12/16 18:43:18 ERROR PythonRDD: Python worker exited unexpectedly
(crashed)

java.net.SocketException: Connection reset by peer: socket write error

at java.net.SocketOutputStream.socketWrite0(Native Method)

at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)

at
java.net.SocketOutputStream.write(SocketOutputStream.java:159)

at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)

at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:121)

at java.io.DataOutputStream.write(DataOutputStream.java:107)

at
java.io.FilterOutputStream.write(FilterOutputStream.java:97)

at
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)

at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)

at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)

at
scala.collection.Iterator$class.foreach(Iterator.scala:727)

at
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)

at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)

at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)

at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)



I am getting this error only when I use MLlib with pyspark. how to resolve
this issue?



Regards,

Surendran


Re: ideal number of executors per machine

2015-12-16 Thread Sean Owen
I don't think it has anything to do with using all the cores, since 1
executor can run as many tasks as you like. Yes, you'd want them to
request all cores in this case. YARN vs Mesos does not matter here.

On Wed, Dec 16, 2015 at 1:58 PM, Michael Segel
 wrote:
> Hmmm.
> This would go against the grain.
>
> I have to ask how you came to that conclusion…
>
> There are a lot of factors…  e.g. Yarn vs Mesos?
>
> What you’re suggesting would mean a loss of parallelism.
>
>
>> On Dec 16, 2015, at 12:22 AM, Sean Owen  wrote:
>>
>> 1 per machine is the right number. If you are running very large heaps
>> (>64GB) you may consider multiple per machine just to make sure each's
>> GC pauses aren't excessive, but even this might be better mitigated
>> with GC tuning.
>>
>> On Tue, Dec 15, 2015 at 9:07 PM, Veljko Skarich
>>  wrote:
>>> Hi,
>>>
>>> I'm looking for suggestions on the ideal number of executors per machine. I
>>> run my jobs on 64G 32 core machines, and at the moment I have one executor
>>> running per machine, on the spark standalone cluster.
>>>
>>> I could not find many guidelines for figuring out the ideal number of
>>> executors; the Spark official documentation merely recommends not having
>>> more than 64G per executor to avoid GC issues. Anyone have and advice on
>>> this?
>>>
>>> thank you.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Ted Yu
This seems related:
[SPARK-10123][DEPLOY] Support specifying deploy mode from configuration

FYI

On Wed, Dec 16, 2015 at 7:31 AM, Saiph Kappa  wrote:

> Hi,
>
> I have a client application running on host0 that is launching multiple
> drivers on multiple remote standalone spark clusters (each cluster is
> running on a single machine):
>
> «
> ...
>
> List("host1", "host2" , "host3").foreach(host => {
>
> val sparkConf = new SparkConf()
> sparkConf.setAppName("App")
>
> sparkConf.set("spark.driver.memory", "4g")
> sparkConf.set("spark.executor.memory", "4g")
> sparkConf.set("spark.driver.maxResultSize", "4g")
> sparkConf.set("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
> sparkConf.set("spark.executor.extraJavaOptions", " -XX:+UseCompressedOops 
> -XX:+UseConcMarkSweepGC " +
>   "-XX:+AggressiveOpts -XX:FreqInlineSize=300 -XX:MaxInlineSize=300 ")
>
> sparkConf.setMaster(s"spark://$host:7077")
>
> val rawStreams = (1 to source.parallelism).map(_ => 
> ssc.textFileStream("/home/user/data/")).toArray
> val rawStream = ssc.union(rawStreams)
> rawStream.count.map(c => s"Received $c records.").print()
>
> }
> ...
>
> »
>
> The problem is that I'm getting an error message saying that the directory 
> "/home/user/data/" does not exist.
> In fact, this directory only exists in host1, host2 and host3 and not in 
> host0.
> But since I'm launching the driver to host1..3 I thought data would be 
> fetched from those machines.
>
> I'm also trying to avoid using the spark submit script, and couldn't find the 
> configuration parameter to specify the deploy mode.
>
> Is there any way to specify the deploy mode through configuration parameter?
>
>
> Thanks.
>
>


Trying to index document in Solr with Spark and solr-spark library

2015-12-16 Thread Guillermo Ortiz
I'm trying to index document to Solr from Spark with the library solr-spark

I have create a project with Maven and include all the dependencies when I
execute spark but I get a ClassNotFoundException. I have check that the
class is in one of the jar that I'm including ( solr-solrj-4.10.3.jar)
I compiled the brach 4.X of this library because the last one it's to Solr
5.X

Th error  seems pretty clear, but I have checked in the sparkUI spark.jars
and the jars appear.

2015-12-16 16:13:40,265 [task-result-getter-3] WARN
 org.apache.spark.ThrowableSerializationWrapper - Task exception could not
be deserialized
java.lang.*ClassNotFoundException:
org.apache.solr.common.cloud.ZooKeeperException*
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:167)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:108)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:105)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Spark Streaming: How to specify deploy mode through configuration parameter?

2015-12-16 Thread Saiph Kappa
Hi,

I have a client application running on host0 that is launching multiple
drivers on multiple remote standalone spark clusters (each cluster is
running on a single machine):

«
...

List("host1", "host2" , "host3").foreach(host => {

val sparkConf = new SparkConf()
sparkConf.setAppName("App")

sparkConf.set("spark.driver.memory", "4g")
sparkConf.set("spark.executor.memory", "4g")
sparkConf.set("spark.driver.maxResultSize", "4g")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.executor.extraJavaOptions", "
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC " +
  "-XX:+AggressiveOpts -XX:FreqInlineSize=300 -XX:MaxInlineSize=300 ")

sparkConf.setMaster(s"spark://$host:7077")

val rawStreams = (1 to source.parallelism).map(_ =>
ssc.textFileStream("/home/user/data/")).toArray
val rawStream = ssc.union(rawStreams)
rawStream.count.map(c => s"Received $c records.").print()

}
...

»

The problem is that I'm getting an error message saying that the
directory "/home/user/data/" does not exist.
In fact, this directory only exists in host1, host2 and host3 and not in host0.
But since I'm launching the driver to host1..3 I thought data would be
fetched from those machines.

I'm also trying to avoid using the spark submit script, and couldn't
find the configuration parameter to specify the deploy mode.

Is there any way to specify the deploy mode through configuration parameter?


Thanks.


Re: Hive error after update from 1.4.1 to 1.5.2

2015-12-16 Thread Bryan Jeffrey
I had a bunch of library dependencies that were still using Scala 2.10
versions. I updated them to 2.11 and everything has worked fine since.

On Wed, Dec 16, 2015 at 3:12 AM, Ashwin Sai Shankar 
wrote:

> Hi Bryan,
> I see the same issue with 1.5.2,  can you pls let me know what was the
> resolution?
>
> Thanks,
> Ashwin
>
> On Fri, Nov 20, 2015 at 12:07 PM, Bryan Jeffrey 
> wrote:
>
>> Nevermind. I had a library dependency that still had the old Spark
>> version.
>>
>> On Fri, Nov 20, 2015 at 2:14 PM, Bryan Jeffrey 
>> wrote:
>>
>>> The 1.5.2 Spark was compiled using the following options:  mvn
>>> -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive
>>> -Phive-thriftserver clean package
>>>
>>> Regards,
>>>
>>> Bryan Jeffrey
>>>
>>> On Fri, Nov 20, 2015 at 2:13 PM, Bryan Jeffrey 
>>> wrote:
>>>
 Hello.

 I'm seeing an error creating a Hive Context moving from Spark 1.4.1 to
 1.5.2.  Has anyone seen this issue?

 I'm invoking the following:

 new HiveContext(sc) // sc is a Spark Context

 I am seeing the following error:

 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in
 [jar:file:/spark/spark-1.5.2/assembly/target/scala-2.11/spark-assembly-1.5.2-hadoop2.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in
 [jar:file:/hadoop-2.6.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 Exception in thread "main" java.lang.NoSuchMethodException:
 org.apache.hadoop.hive.conf.HiveConf.getTimeVar(org.apache.hadoop.hive.conf.HiveConf$ConfVars,
 java.util.concurrent.TimeUnit)
 at java.lang.Class.getMethod(Class.java:1786)
 at
 org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:114)
 at
 org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod$lzycompute(HiveShim.scala:415)
 at
 org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod(HiveShim.scala:414)
 at
 org.apache.spark.sql.hive.client.Shim_v0_14.getMetastoreClientConnectRetryDelayMillis(HiveShim.scala:459)
 at
 org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:198)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at
 java.lang.reflect.Constructor.newInstance(Constructor.java:422)
 at
 org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
 at
 org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:179)
 at
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226)
 at
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
 at
 org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
 at
 org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174)
 at
 org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177)
 at
 Main.Factories.HiveContextSingleton$.createHiveContext(HiveContextSingleton.scala:21)
 at
 Main.Factories.HiveContextSingleton$.getHiveContext(HiveContextSingleton.scala:14)
 at
 Main.Factories.SparkStreamingContextFactory$.createSparkContext(SparkStreamingContextFactory.scala:35)
 at Main.WriteModel$.main(WriteModel.scala:16)
 at Main.WriteModel.main(WriteModel.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
 at
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
 at
 org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
 at
 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

>>>
>>>
>>
>


WholeTextFile for 8000~ files - problem

2015-12-16 Thread Eran Witkon
Hi,
I have about 8K files on about 10 directories on hdfs and I need to add a
column to all files with the file name (e.g. file1.txt adds a column with
file1.txt, file 2 with "file2.txt" etc)

The current approach was to read all files using *sc.WholeTextFiles("myPath")
*and have the file name as key and add it as coulmn to each file.

1) I run this on 5 servers each with 24 cores and 24GB RAM with a config of
:
*spark-shell --master yarn-client --executor-core 5 --executor-memory 5G*
But when we run this on all directories at once
(sc.WholeTextFiles("/MySource/*/*") I am getting *java.lang.OutOfMemoryError:
Java heap space*
When running on a single directory all works well
*sc.WholeTextFiles("/MySource/dir1/*") *.

2) One other option is not to use WholeTextFile but read each line with
sc.textFile, but how can I get the file name with textFile?

Eran


Re: Preventing an RDD from shuffling

2015-12-16 Thread PhuDuc Nguyen
There is a way and it's called "map-side-join". To be clear, there is no
explicit function call/API to execute a map-side-join. You have to code it
using a local/broadcast value combined with the map() function. A caveat
for this to work is that one side of the join must be small-ish to exist as
a local/broadcast value. A description of what you're trying to achieve is
a partition local join via the map function. The results are equivalent to
a join but avoids a cluster wide shuffle.

Read the pdf below and look for "Example: Join". This will explain how
joins work in Spark and how you can try to optimize it with a map-side-join
(if your use case fits).
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf

HTH,
Duc



On Wed, Dec 16, 2015 at 3:23 AM, sparkuser2345 
wrote:

> Is there a way to prevent an RDD from shuffling in a join operation without
> repartitioning it?
>
> I'm reading an RDD from sharded MongoDB, joining that with an RDD of
> incoming data (+ some additional calculations), and writing the resulting
> RDD back to MongoDB. It would make sense to shuffle only the incoming data
> RDD so that the joined RDD would already be partitioned correctly according
> to the MondoDB shard key.
>
> I know I can prevent an RDD from shuffling in a join operation by
> partitioning it beforehand but partitioning would already shuffle the RDD.
> In addition, I'm only doing the join once per RDD read from MongoDB. Is
> there a way to tell Spark to shuffle only the incoming data RDD?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Preventing-an-RDD-from-shuffling-tp25717.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: hiveContext: storing lookup of partitions

2015-12-16 Thread Gourav Sengupta
The HIVE table has very large number of partitions around 365 * 5 * 10 and
when I say hivemetastore to start running queries on it (the one with
.count() or .show()) then it takes around 2 hours before the job starts in
SPARK.

On the pyspark screen I can see that it is parsing the S3 locations for
these 2 hours.

Regards,
Gourav

On Wed, Dec 16, 2015 at 3:38 AM, Jeff Zhang  wrote:

> >>> Currently it takes around 1.5 hours for me just to cache in the
> partition information and after that I can see that the job gets queued in
> the SPARK UI.
> I guess you mean the stage of getting the split info. I suspect it might
> be your cluster issue (or metadata store), unusually it won't take such
> long time for splitting.
>
> On Wed, Dec 16, 2015 at 8:06 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a HIVE table with few thousand partitions (based on date and
>> time). It takes a long time to run if for the first time and then
>> subsequently it is fast.
>>
>> Is there a way to store the cache of partition lookups so that every time
>> I start a new SPARK instance (cannot keep my personal server running
>> continuously), I can immediately restore back the temptable in hiveContext
>> without asking it go again and cache the partition lookups?
>>
>> Currently it takes around 1.5 hours for me just to cache in the partition
>> information and after that I can see that the job gets queued in the SPARK
>> UI.
>>
>> Regards,
>> Gourav
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Jacek Laskowski
Hi,

While reviewing Spark code I came across SparkContext.cancelJob. I
found no part of Spark using it. Is this a leftover after some
refactoring? Why is this part of sc?

The reason I'm asking is another question I'm having after having
learnt about killing a stage in webUI. I noticed there is a way to
kill/cancel stages, but no corresponding feature to kill/cancel jobs.
Why? Is there a JIRA ticket to have it some day perhaps?

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Trying to index document in Solr with Spark and solr-spark library

2015-12-16 Thread Guillermo Ortiz
I have found a error more specific in an executor. The another error is
happening in the Driver.

I have navigate in Zookeeper in the collection is created. Anyway, it seems
more a problem with Solr than Spark right now.

2015-12-16 16:31:43,923 [Executor task launch worker-1] INFO
org.apache.zookeeper.ZooKeeper - Session: 0x1519126c7d55b23 closed
2015-12-16 16:31:43,924 [Executor task launch worker-1] ERROR
org.apache.spark.executor.Executor - Exception in task 5.2 in stage
12.0 (TID 218)
org.apache.solr.common.cloud.ZooKeeperException:
at 
org.apache.solr.client.solrj.impl.CloudSolrServer.connect(CloudSolrServer.java:252)
at com.lucidworks.spark.SolrSupport.getSolrServer(SolrSupport.java:67)
at com.lucidworks.spark.SolrSupport$4.call(SolrSupport.java:162)
at com.lucidworks.spark.SolrSupport$4.call(SolrSupport.java:160)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /live_nodes
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:290)
at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:287)
at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:74)
at 
org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:287)
at 
org.apache.solr.common.cloud.ZkStateReader.createClusterStateWatchersAndUpdate(ZkStateReader.java:334)
at 
org.apache.solr.client.solrj.impl.CloudSolrServer.connect(CloudSolrServer.java:243)


2015-12-16 16:26 GMT+01:00 Guillermo Ortiz :

> I'm trying to index document to Solr from Spark with the library solr-spark
>
> I have create a project with Maven and include all the dependencies when I
> execute spark but I get a ClassNotFoundException. I have check that the
> class is in one of the jar that I'm including ( solr-solrj-4.10.3.jar)
> I compiled the brach 4.X of this library because the last one it's to Solr
> 5.X
>
> Th error  seems pretty clear, but I have checked in the sparkUI spark.jars
> and the jars appear.
>
> 2015-12-16 16:13:40,265 [task-result-getter-3] WARN
>  org.apache.spark.ThrowableSerializationWrapper - Task exception could not
> be deserialized
> java.lang.*ClassNotFoundException:
> org.apache.solr.common.cloud.ZooKeeperException*
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:167)
> at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 

Re: SparkContext.cancelJob - what part of Spark uses it? Nothing in webUI to kill jobs?

2015-12-16 Thread Mark Hamstra
It can be used, and is used in user code, but it isn't always as
straightforward as you might think.  This is mostly because a Job often
isn't a Job -- or rather it is more than one Job.  There are several RDD
transformations that aren't lazy, so they end up launching "hidden" Jobs
that you may not anticipate and may expect to be canceled (but won't be) by
a cancelJob() called on a later action on that transformed RDD.  It is also
possible for a single DataFrame or Spark SQL query to result in more than
one running Job.  The upshot of all of this is that getting cancelJob() to
work as most users would expect all the time is non-trivial, and most of
the time using a jobGroup is a better way to capture what may be more than
one Job that the user is thinking of as a single Job.

On Wed, Dec 16, 2015 at 5:34 AM, Sean Owen  wrote:

> It does look like it's not actually used. It may simply be there for
> completeness, to match cancelStage and cancelJobGroup, which are used.
> I also don't know of a good reason there's no way to kill a whole job.
>
> On Wed, Dec 16, 2015 at 1:15 PM, Jacek Laskowski  wrote:
> > Hi,
> >
> > While reviewing Spark code I came across SparkContext.cancelJob. I
> > found no part of Spark using it. Is this a leftover after some
> > refactoring? Why is this part of sc?
> >
> > The reason I'm asking is another question I'm having after having
> > learnt about killing a stage in webUI. I noticed there is a way to
> > kill/cancel stages, but no corresponding feature to kill/cancel jobs.
> > Why? Is there a JIRA ticket to have it some day perhaps?
> >
> > Pozdrawiam,
> > Jacek
> >
> > --
> > Jacek Laskowski | https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark
> > ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> > Follow me at https://twitter.com/jaceklaskowski
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark-shell connecting to Mesos stuck at sched.cpp

2015-12-16 Thread Aaron
Found this thread that talked about it to help understand it better:

https://mail-archives.apache.org/mod_mbox/mesos-user/201507.mbox/%3ccajq68qf9pejgnwomasm2dqchyaxpcaovnfkfgggxxpzj2jo...@mail.gmail.com%3E

>
> When you run Spark on Mesos it needs to run
>
> spark driver
> mesos scheduler
>
> and both need to be visible to outside world on public iface IP
>
> you need to tell Spark and Mesos on which interface to bind - by default
> they resolve node hostname to ip - this is loopback address in your case
>
> Possible solutions - on slave node with public IP 192.168.56.50
>
> 1. Set
>
>export LIBPROCESS_IP=192.168.56.50
>export SPARK_LOCAL_IP=192.168.56.50
>
> 2. Ensure your hostname resolves to public iface IP - (for testing) edit
> /etc/hosts to resolve your domain name to 192.168.56.50
> 3. Set correct hostname/ip in mesos configuration - see Nikolaos answer
>

Cheers,
Aaron


On Mon, Nov 16, 2015 at 9:37 PM, Jo Voordeckers
 wrote:
> I've seen this issue when the mesos cluster couldn't figure out my IP
> address correctly, have you tried setting the ENV var with your IP address
> when launching spark or mesos cluster dispatcher like:
>
>  LIBPROCESS_IP="172.16.0.180"
>
>
> - Jo Voordeckers
>
>
> On Sun, Nov 15, 2015 at 6:59 PM, Jong Wook Kim  wrote:
>>
>> I'm having problem connecting my spark app to a Mesos cluster; any help on
>> the below question would be appreciated.
>>
>>
>> http://stackoverflow.com/questions/33727154/spark-shell-connecting-to-mesos-stuck-at-sched-cpp
>>
>> Thanks,
>> Jong Wook
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Using Spark to process JSON with gzip filed

2015-12-16 Thread Eran Witkon
Hi,
I have a few JSON files in which one of the field is a binary filed - this
field is the output of running GZIP of a JSON stream and compressing it to
the binary field.

Now I want to de-compress the field and get the outpur JSON.
I was thinking of running map operation and passing a function to the map
operation which will decompress each JSON file.
the above function will find the right field in the outer JSON and then run
GUNZIP on it.

1) is this a valid practice for spark map job?
2) any pointer on how to do that?

Eran


SparkEx PiAverage: Re: How to meet nested loop on pairRdd?

2015-12-16 Thread MegaLearn
I am wondering about the same concept as the OP, did anyone have an answer
for this question? I can't see that Spark has loops built in, except to loop
over a dataset of existing/known size. Thus I often create a "dummy"
ArrayList and pass it to parallelize to control how many times Spark will
run the function I supply.

// setup "dummy" ArrayList of size HOW_MANY_DARTS -- how many 
darts to
throw
List throwsList = new 
ArrayList(HOW_MANY_DARTS);

JavaRDD dataSet = jsc.parallelize(throwsList, SLICES);

Then if I want to do this nested loop the OP was asking about, it seems like
I am going to nest parallelize calls, with the outer parallelize being
passed an ArrayList of length i and then inner parallelize getting a
different ArrayList of length j?

Is there not a more direct way to tell Spark to do something a fixed number
of times? It seems like it's geared to datasets of existing length, like
"iterate over the complete works of Shakespeare" which already has content
of a specific length.

In my case, for the simplest example, I am trying to run the PiAverage Spark
example, which "throws darts" at a circle j times. Then I want to run that
PiAverage i times and average the results to see if they really converge on
pi.

I would be happy to post  more code but it seems like a more conceptual
question I am asking? Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-meet-nested-loop-on-pairRdd-tp21121p25718.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Error while running a job in yarn-client mode

2015-12-16 Thread sunil m
Hello Spark experts!


I am using spark 1.5.1 and get the following exception  while running
sample applications

Any tips/ hints on how to solve the error below will be of great help!


_
*Exception in thread "main" java.lang.IllegalStateException: Cannot call
methods on a stopped SparkContext*
at org.apache.spark.SparkContext.org
$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at
org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:2052)
at
org.apache.spark.api.java.JavaSparkContext.parallelizePairs(JavaSparkContext.scala:169)
at
com.slb.sis.bolt.test.samples.SparkAPIExamples.mapsFromPairsToPairs(SparkAPIExamples.java:156)
at
com.slb.sis.bolt.test.samples.SparkAPIExamples.main(SparkAPIExamples.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Thanks in advance.

Warm regards,
Sunil M.


Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Bryan Cutler
Hi Andy,

Regarding the foreachrdd return value, this Jira that will be in 1.6 should
take care of that https://issues.apache.org/jira/browse/SPARK-4557 and make
things a little simpler.
On Dec 15, 2015 6:55 PM, "Andy Davidson" 
wrote:

> I am writing  a JUnit test for some simple streaming code. I want to make
> assertions about how many things are in a given JavaDStream. I wonder if there
> is an easier way in Java to get the count?
>
> I think there are two points of friction.
>
>
>1. is it easy to create an accumulator of type double or int, How ever
>Long is not supported
>2. We need to use javaDStream.foreachRDD. The Function interface must
>return void. I was not able to define an accumulator in my driver and
>use a lambda function. (I am new to lambda in Java)
>
> Here is a little lambda example that logs my test objects. I was not able
> to figure out how to get  to return a value or access a accumulator
>
>data.foreachRDD(rdd -> {
>
> logger.info(“Begin data.foreachRDD" );
>
> for (MyPojo pojo : rdd.collect()) {
>
> logger.info("\n{}", pojo.toString());
>
> }
>
> return null;
>
> });
>
>
> Any suggestions would be greatly appreciated
>
> Andy
>
> This following code works in my driver but is a lot of code for such a
> trivial computation. Because it needs to the JavaSparkContext I do not
> think it would work inside a closure. I assume the works do not have access
> to the context as a global and that it shipping it in the closure is not a
> good idea?
>
> public class JavaDStreamCount implements Serializable {
>
> private static final long serialVersionUID = -3600586183332429887L;
>
> public static Logger logger =
> LoggerFactory.getLogger(JavaDStreamCount.class);
>
>
>
> public Double hack(JavaSparkContext sc, JavaDStream javaDStream) {
>
> Count c = new Count(sc);
>
> javaDStream.foreachRDD(c);
>
> return c.getTotal().value();
>
> }
>
>
>
> class Count implements Function {
>
> private static final long serialVersionUID =
> -5239727633710162488L;
>
> Accumulator total;
>
>
>
> public Count(JavaSparkContext sc) {
>
> total = sc.accumulator(0.0);
>
> }
>
>
>
> @Override
>
> public java.lang.Void call(JavaRDD rdd) throws Exception {
>
> List data = rdd.collect();
>
> int dataSize = data.size();
>
> logger.error("data.size:{}", dataSize);
>
> long num = rdd.count();
>
> logger.error("num:{}", num);
>
> total.add(new Double(num));
>
> return null;
>
> }
>
>
> public Accumulator getTotal() {
>
> return total;
>
> }
>
> }
>
> }
>
>
>
>
>


Re: Compiling spark 1.5.1 fails with scala.reflect.internal.Types$TypeError: bad symbolic reference.

2015-12-16 Thread Simon Hafner
It happens with 2.11, you'll have to do both:

./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

you get that error if you forget one IIRC.

2015-12-05 20:17 GMT+08:00 MrAsanjar . :
> Simon, I am getting the same error, how did you resolved the problem.
>
> On Fri, Oct 16, 2015 at 9:54 AM, Simon Hafner  wrote:
>>
>> Fresh clone of spark 1.5.1, java version "1.7.0_85"
>>
>> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
>> package
>>
>> [error] bad symbolic reference. A signature in WebUI.class refers to
>> term eclipse
>> [error] in package org which is not available.
>> [error] It may be completely missing from the current classpath, or
>> the version on
>> [error] the classpath might be incompatible with the version used when
>> compiling WebUI.class.
>> [error] bad symbolic reference. A signature in WebUI.class refers to term
>> jetty
>> [error] in value org.eclipse which is not available.
>> [error] It may be completely missing from the current classpath, or
>> the version on
>> [error] the classpath might be incompatible with the version used when
>> compiling WebUI.class.
>> [error]
>> [error]  while compiling:
>> /root/spark/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
>> [error] during phase: erasure
>> [error]  library version: version 2.10.4
>> [error] compiler version: version 2.10.4
>> [error]   reconstructed args: -deprecation -classpath
>>
>> /root/spark/sql/core/target/scala-2.10/classes:/root/.m2/repository/org/apache/spark/spark-core_2.10/1.6.0-SNAPSHOT/spark-core_2.10-1.6.0-SNAPSHOT.jar:/root/.m
>>
>> 2/repository/org/apache/avro/avro-mapred/1.7.7/avro-mapred-1.7.7-hadoop2.jar:/root/.m2/repository/org/apache/avro/avro-ipc/1.7.7/avro-ipc-1.7.7.jar:/root/.m2/repository/org/apache/avro/avro-ipc/1.7.7/avro-ipc-1.7.
>>
>> 7-tests.jar:/root/.m2/repository/com/twitter/chill_2.10/0.5.0/chill_2.10-0.5.0.jar:/root/.m2/repository/com/esotericsoftware/kryo/kryo/2.21/kryo-2.21.jar:/root/.m2/repository/com/esotericsoftware/reflectasm/reflec
>>
>> tasm/1.07/reflectasm-1.07-shaded.jar:/root/.m2/repository/com/esotericsoftware/minlog/minlog/1.2/minlog-1.2.jar:/root/.m2/repository/com/twitter/chill-java/0.5.0/chill-java-0.5.0.jar:/root/.m2/repository/org/apach
>>
>> e/hadoop/hadoop-client/2.4.0/hadoop-client-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-common/2.4.0/hadoop-common-2.4.0.jar:/root/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/root/.m
>>
>> 2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/root/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/root/.m2/repository/commons-collections/commons-collections/3.2.1/commons-
>>
>> collections-3.2.1.jar:/root/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/root/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/root/.m
>>
>> 2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/root/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/root/.m2/repository/org/apac
>>
>> he/hadoop/hadoop-auth/2.4.0/hadoop-auth-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.4.0/hadoop-hdfs-2.4.0.jar:/root/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/root
>>
>> /.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.4.0/hadoop-mapreduce-client-app-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-common/2.4.0/hadoop-mapreduce-client-common-
>>
>> 2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.4.0/hadoop-yarn-client-2.4.0.jar:/root/.m2/repository/com/sun/jersey/jersey-client/1.9/jersey-client-1.9.jar:/root/.m2/repository/org/apache/ha
>>
>> doop/hadoop-yarn-server-common/2.4.0/hadoop-yarn-server-common-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-shuffle/2.4.0/hadoop-mapreduce-client-shuffle-2.4.0.jar:/root/.m2/repository/
>>
>> org/apache/hadoop/hadoop-yarn-api/2.4.0/hadoop-yarn-api-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.4.0/hadoop-mapreduce-client-core-2.4.0.jar:/root/.m2/repository/org/apache/ha
>>
>> doop/hadoop-yarn-common/2.4.0/hadoop-yarn-common-2.4.0.jar:/root/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/root/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/root/.m2/re
>>
>> pository/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.4.0/hadoop-mapreduce-client-jobclient-2.4.0.jar:/root/.m2/repository/org/apache/hadoop/hadoop-annotations/2.4.0/hadoop-annotations-2.4.0.jar:/root/.m2
>>
>> /repository/org/apache/spark/spark-launcher_2.10/1.6.0-SNAPSHOT/spark-launcher_2.10-1.6.0-SNAPSHOT.jar:/root/.m2/repository/org/apache/spark/spark-network-common_2.10/1.6.0-SNAPSHOT/spark-network-common_2.10-1.6.0
>>
>> 

Re: Spark big rdd problem

2015-12-16 Thread Eran Witkon
I run the yarn log command and got the following:
A set of yarnAllocator warnings 'expected to find requests, but found none.'
Then an error:
Akka. ErrorMonitor: associationError ...
But then I still get final app status: Succeeded, exit code 0
What does these errors mean?

On Wed, 16 Dec 2015 at 08:27 Eran Witkon  wrote:

> But what if I don't have more memory?
> On Wed, 16 Dec 2015 at 08:13 Zhan Zhang  wrote:
>
>> There are two cases here. If the container is killed by yarn, you can
>> increase jvm overhead. Otherwise, you have to increase the executor-memory
>> if there is no memory leak happening.
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>> On Dec 15, 2015, at 9:58 PM, Eran Witkon  wrote:
>>
>> If the problem is containers trying to use more memory then they allowed,
>> how do I limit them? I all ready have executor-memory 5G
>> Eran
>> On Tue, 15 Dec 2015 at 23:10 Zhan Zhang  wrote:
>>
>>> You should be able to get the logs from yarn by “yarn logs
>>> -applicationId xxx”, where you can possible find the cause.
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>> On Dec 15, 2015, at 11:50 AM, Eran Witkon  wrote:
>>>
>>> > When running
>>> > val data = sc.wholeTextFile("someDir/*") data.count()
>>> >
>>> > I get numerous warning from yarn till I get aka association exception.
>>> > Can someone explain what happen when spark loads this rdd and can't
>>> fit it all in memory?
>>> > Based on the exception it looks like the server is disconnecting from
>>> yarn and failing... Any idea why? The code is simple but still failing...
>>> > Eran
>>>
>>>
>>


Re: Hive error after update from 1.4.1 to 1.5.2

2015-12-16 Thread Ashwin Sai Shankar
Hi Bryan,
I see the same issue with 1.5.2,  can you pls let me know what was the
resolution?

Thanks,
Ashwin

On Fri, Nov 20, 2015 at 12:07 PM, Bryan Jeffrey 
wrote:

> Nevermind. I had a library dependency that still had the old Spark version.
>
> On Fri, Nov 20, 2015 at 2:14 PM, Bryan Jeffrey 
> wrote:
>
>> The 1.5.2 Spark was compiled using the following options:  mvn
>> -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive
>> -Phive-thriftserver clean package
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>> On Fri, Nov 20, 2015 at 2:13 PM, Bryan Jeffrey 
>> wrote:
>>
>>> Hello.
>>>
>>> I'm seeing an error creating a Hive Context moving from Spark 1.4.1 to
>>> 1.5.2.  Has anyone seen this issue?
>>>
>>> I'm invoking the following:
>>>
>>> new HiveContext(sc) // sc is a Spark Context
>>>
>>> I am seeing the following error:
>>>
>>> SLF4J: Class path contains multiple SLF4J bindings.
>>> SLF4J: Found binding in
>>> [jar:file:/spark/spark-1.5.2/assembly/target/scala-2.11/spark-assembly-1.5.2-hadoop2.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in
>>> [jar:file:/hadoop-2.6.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>> explanation.
>>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>>> Exception in thread "main" java.lang.NoSuchMethodException:
>>> org.apache.hadoop.hive.conf.HiveConf.getTimeVar(org.apache.hadoop.hive.conf.HiveConf$ConfVars,
>>> java.util.concurrent.TimeUnit)
>>> at java.lang.Class.getMethod(Class.java:1786)
>>> at
>>> org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:114)
>>> at
>>> org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod$lzycompute(HiveShim.scala:415)
>>> at
>>> org.apache.spark.sql.hive.client.Shim_v0_14.getTimeVarMethod(HiveShim.scala:414)
>>> at
>>> org.apache.spark.sql.hive.client.Shim_v0_14.getMetastoreClientConnectRetryDelayMillis(HiveShim.scala:459)
>>> at
>>> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:198)
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>>> at
>>> org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:183)
>>> at
>>> org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:179)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:226)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174)
>>> at
>>> org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:177)
>>> at
>>> Main.Factories.HiveContextSingleton$.createHiveContext(HiveContextSingleton.scala:21)
>>> at
>>> Main.Factories.HiveContextSingleton$.getHiveContext(HiveContextSingleton.scala:14)
>>> at
>>> Main.Factories.SparkStreamingContextFactory$.createSparkContext(SparkStreamingContextFactory.scala:35)
>>> at Main.WriteModel$.main(WriteModel.scala:16)
>>> at Main.WriteModel.main(WriteModel.scala)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>
>>
>


Preventing an RDD from shuffling

2015-12-16 Thread sparkuser2345
Is there a way to prevent an RDD from shuffling in a join operation without
repartitioning it? 

I'm reading an RDD from sharded MongoDB, joining that with an RDD of
incoming data (+ some additional calculations), and writing the resulting
RDD back to MongoDB. It would make sense to shuffle only the incoming data
RDD so that the joined RDD would already be partitioned correctly according
to the MondoDB shard key. 

I know I can prevent an RDD from shuffling in a join operation by
partitioning it beforehand but partitioning would already shuffle the RDD.
In addition, I'm only doing the join once per RDD read from MongoDB. Is
there a way to tell Spark to shuffle only the incoming data RDD?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Preventing-an-RDD-from-shuffling-tp25717.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org