Re: SparkLauncher is blocked until mail process is killed.

2015-10-29 Thread Jey Kottalam
Could you please provide the jstack output? That would help the devs
identify the blocking operation more easily.

On Thu, Oct 29, 2015 at 6:54 PM, 陈宇航  wrote:

> I tried to use SparkLauncher (org.apache.spark.launcher.SparkLauncher) to
> submit a Spark Streaming job, however, in my test, the SparkSubmit process
> got stuck in the "addJar" procedure. Only when the main process (the caller
> of SparkLauncher) is killed, the submit procedeure continues to run. I ran
> jstack for the process, it seems jetty was blocking it, and I'm pretty sure
> there was no port conflicts.
>
> The environment is RHEL(RedHot Enterprise Linux) 6u3 x64, Spark runs in
> standalone mode.
>
> Did this happen to any of you?
>
>


Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
 at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> at java.lang.reflect.Method.invoke(Method.java:611)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> at java.lang.reflect.Method.invoke(Method.java:611)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
> ... 83 more
>
> Regards,
> Sourav
>
>
> On Mon, Jun 29, 2015 at 6:53 PM, Jey Kottalam  wrote:
>
>> The format is still "com.databricks.spark.csv", but the parameter passed
>> to spark-shell is "--packages com.databricks:spark-csv_2.11:1.1.0".
>>
>> On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder <
>> sourav.mazumde...@gmail.com> wrote:
>>
>>> HI Jey,
>>>
>>> Not much of luck.
>>>
>>> If I use the class com.databricks:spark-csv_2.
>>> 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found
>>> error. With com.databricks.spark.csv I don't get the class not found error
>>> but I still get the previous error even after using file:/// in the URI.
>>>
>>> Regards,
>>> Sourav
>>>
>>> On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam 
>>> wrote:
>>>
>>>> Hi Sourav,
>>>>
>>>> The error seems to be caused by the fact that your URL starts with
>>>> "file://" instead of "file:///".
>>>>
>>>> Also, I believe the current version of the package for Spark 1.4 with
>>>> Scala 2.11 should be "com.databricks:spark-csv_2.11:1.1.0".
>>>>
>>>> -Jey
>>>>
>>>> On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder <
>>>> sourav.mazumde...@gmail.com> wrote:
>>>>
>>>>> Hi Jey,
>>>>>
>>>>> Thanks for your inputs.
>>>>>
>>>>> Probably I'm getting error as I'm trying to read a csv file from local
>>>>> file using com.databricks.spark.csv package. Probably this package has 
>>>>> hard
>>>>> coded dependency on Hadoop as it is trying to read input format from
>>>>> HadoopRDD.
>>>>>
>>>>> Can you please confirm ?
>>>>>
>>>>> Here is what I did -
>>>>>
>>>>> Ran the spark-shell as
>>>>>
>>>>> bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3.
>>>>>
>>>>> Then in the shell I ran :
>>>>> val df = 
>>>>> sqlContext.read.format("com.databricks.spark.csv").load("file://home/biadmin/DataScience/PlutoMN.csv")
>>>>>
>>>>>
>>>>>
>>>>&g

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
The format is still "com.databricks.spark.csv", but the parameter passed to
spark-shell is "--packages com.databricks:spark-csv_2.11:1.1.0".

On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder <
sourav.mazumde...@gmail.com> wrote:

> HI Jey,
>
> Not much of luck.
>
> If I use the class com.databricks:spark-csv_2.
> 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found
> error. With com.databricks.spark.csv I don't get the class not found error
> but I still get the previous error even after using file:/// in the URI.
>
> Regards,
> Sourav
>
> On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam  wrote:
>
>> Hi Sourav,
>>
>> The error seems to be caused by the fact that your URL starts with
>> "file://" instead of "file:///".
>>
>> Also, I believe the current version of the package for Spark 1.4 with
>> Scala 2.11 should be "com.databricks:spark-csv_2.11:1.1.0".
>>
>> -Jey
>>
>> On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder <
>> sourav.mazumde...@gmail.com> wrote:
>>
>>> Hi Jey,
>>>
>>> Thanks for your inputs.
>>>
>>> Probably I'm getting error as I'm trying to read a csv file from local
>>> file using com.databricks.spark.csv package. Probably this package has hard
>>> coded dependency on Hadoop as it is trying to read input format from
>>> HadoopRDD.
>>>
>>> Can you please confirm ?
>>>
>>> Here is what I did -
>>>
>>> Ran the spark-shell as
>>>
>>> bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3.
>>>
>>> Then in the shell I ran :
>>> val df = 
>>> sqlContext.read.format("com.databricks.spark.csv").load("file://home/biadmin/DataScience/PlutoMN.csv")
>>>
>>>
>>>
>>> Regards,
>>> Sourav
>>>
>>> 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from
>>> textFile at CsvRelation.scala:114
>>> java.lang.RuntimeException: Error in configuring object
>>> at
>>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
>>> at
>>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
>>> at
>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>>> at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
>>> at scala.Option.getOrElse(Option.scala:120)
>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
>>> at scala.Option.getOrElse(Option.scala:120)
>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
>>> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>>> at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
>>> at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>>> at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
>>> at
>>> com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114)
>>> at
>>> com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112)
>>> at
>>> com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95)
>>> at com.databricks.spark.csv.CsvRelation.(CsvRelation.scala:53)
>>> at
>>> com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:8

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
lect.Method.invoke(Method.java:611)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
> at
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> at java.lang.reflect.Method.invoke(Method.java:611)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> at java.lang.reflect.Method.invoke(Method.java:611)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
> ... 83 more
>
>
>
> On Mon, Jun 29, 2015 at 10:02 AM, Jey Kottalam 
> wrote:
>
>> Actually, Hadoop InputFormats can still be used to read and write from
>> "file://", "s3n://", and similar schemes. You just won't be able to
>> read/write to HDFS without installing Hadoop and setting up an HDFS cluster.
>>
>> To summarize: Sourav, you can use any of the prebuilt packages (i.e.
>> anything other than "source code").
>>
>> Hope that helps,
>> -Jey
>>
>> On Mon, Jun 29, 2015 at 7:33 AM, ayan guha  wrote:
>>
>>> Hi
>>>
>>> You really donot need hadoop installation. You can dowsload a pre-built
>>> version with any hadoop and unzip it and you are good to go. Yes it may
>>> complain while launching master and workers, safely ignore them. The only
>>> problem is while writing to a directory. Of course you will not be able to
>>> use any hadoop inputformat etc. out of the box.
>>>
>>> ** I am assuming its a learning question :) For production, I would
>>> suggest build it from source.
>>>
>>> If you are using python and need some help, please drop me a note off
>>> line.
>>>
>>> Best
>>> Ayan
>>>
>>> On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder <
>>> sourav.mazumde...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to run Spark without Hadoop where the data would be read and
>>>> written to local disk.
>>>>
>>>> For this I have few Questions -
>>>>
>>>> 1. Which download I need to use ? In the download option I don't see
>>>> any binary download which does not need Hadoop. Is the only way to do this
>>>> to download the source code version and compile the same ?
>>>>
>>>> 2. Which installation/quick start guideline I should use for the same.
>>>> So far I didn't see any documentation which specifically addresses the
>>>> Spark without Hadoop installation/setup unless I'm missing out one.
>>>>
>>>> Regards,
>>>> Sourav
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>


Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
Actually, Hadoop InputFormats can still be used to read and write from
"file://", "s3n://", and similar schemes. You just won't be able to
read/write to HDFS without installing Hadoop and setting up an HDFS cluster.

To summarize: Sourav, you can use any of the prebuilt packages (i.e.
anything other than "source code").

Hope that helps,
-Jey

On Mon, Jun 29, 2015 at 7:33 AM, ayan guha  wrote:

> Hi
>
> You really donot need hadoop installation. You can dowsload a pre-built
> version with any hadoop and unzip it and you are good to go. Yes it may
> complain while launching master and workers, safely ignore them. The only
> problem is while writing to a directory. Of course you will not be able to
> use any hadoop inputformat etc. out of the box.
>
> ** I am assuming its a learning question :) For production, I would
> suggest build it from source.
>
> If you are using python and need some help, please drop me a note off line.
>
> Best
> Ayan
>
> On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder <
> sourav.mazumde...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to run Spark without Hadoop where the data would be read and
>> written to local disk.
>>
>> For this I have few Questions -
>>
>> 1. Which download I need to use ? In the download option I don't see any
>> binary download which does not need Hadoop. Is the only way to do this to
>> download the source code version and compile the same ?
>>
>> 2. Which installation/quick start guideline I should use for the same. So
>> far I didn't see any documentation which specifically addresses the Spark
>> without Hadoop installation/setup unless I'm missing out one.
>>
>> Regards,
>> Sourav
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Get importerror when i run pyspark with ipython=1

2015-02-26 Thread Jey Kottalam
Hi Sourabh, could you try it with the stable 2.4 version of IPython?

On Thu, Feb 26, 2015 at 8:54 PM, sourabhguha  wrote:
> 
>
> I get the above error when I try to run pyspark with the ipython option. I
> do not get this error when I run it without the ipython option.
>
> I have Java 8, Scala 2.10.4 and Enthought Canopy Python on my box. OS Win
> 8.1 Desktop.
>
> Thanks,
> Sourabh
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Get-importerror-when-i-run-pyspark-with-ipython-1-tp21843.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reduceByKey vs countByKey

2015-02-24 Thread Jey Kottalam
Hi Sathish,

The current implementation of countByKey uses reduceByKey:
https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L332

It seems that countByKey is mostly deprecated:
https://issues.apache.org/jira/browse/SPARK-3994

-Jey

On Tue, Feb 24, 2015 at 3:53 PM, Sathish Kumaran Vairavelu
 wrote:
> Hello,
>
> Quick question. I am trying to understand difference between reduceByKey vs
> countByKey? Which one gives better performance reduceByKey or countByKey?
> While we can perform same count operation using reduceByKey why we need
> countByKey/countByValue?
>
> Thanks
>
> Sathish

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Extracting values from a Collecion

2014-11-21 Thread Jey Kottalam
Hi Sanjay,

These are instances of the standard Scala collection type "Set", and its
documentation can be found by googling the phrase "scala set".

Hope that helps,
-Jey

On Fri, Nov 21, 2014 at 10:41 AM, Sanjay Subramanian
 wrote:
> hey guys
>
> names.txt
> =
> 1,paul
> 2,john
> 3,george
> 4,ringo
>
>
> songs.txt
> =
> 1,Yesterday
> 2,Julia
> 3,While My Guitar Gently Weeps
> 4,With a Little Help From My Friends
> 1,Michelle
> 2,Nowhere Man
> 3,Norwegian Wood
> 4,Octopus's Garden
>
> What I want to do is real simple
>
> Desired Output
> ==
> (4,(With a Little Help From My Friends, Octopus's Garden))
> (2,(Julia, Nowhere Man))
> (3,(While My Guitar Gently Weeps, Norwegian Wood))
> (1,(Yesterday, Michelle))
>
>
> My Code
> ===
> val file1Rdd =
> sc.textFile("/Users/sansub01/mycode/data/songs/names.txt").map(x =>
> (x.split(",")(0), x.split(",")(1)))
> val file2Rdd =
> sc.textFile("/Users/sansub01/mycode/data/songs/songs.txt").map(x =>
> (x.split(",")(0), x.split(",")(1)))
> val file2RddGrp = file2Rdd.groupByKey()
> file2Rdd.groupByKey().mapValues(names =>
> names.toSet).collect().foreach(println)
>
> Result
> ===
> (4,Set(With a Little Help From My Friends, Octopus's Garden))
> (2,Set(Julia, Nowhere Man))
> (3,Set(While My Guitar Gently Weeps, Norwegian Wood))
> (1,Set(Yesterday, Michelle))
>
>
> How can I extract values from the Set ?
>
> Thanks
>
> sanjay
>


Re: Is it possible to use Parquet with Dremel encoding

2014-09-25 Thread Jey Kottalam
Hi Matthes,

You may find the following blog post relevant:
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

Hope that helps,
-Jey

On Thu, Sep 25, 2014 at 5:05 PM, matthes  wrote:
> Hi again!
>
> At the moment I try to use parquet and I want to keep the data into the
> memory in an efficient way to make requests against the data as fast as
> possible.
> I read about parquet it is able to encode nested columns. Parquet uses the
> Dremel encoding with definition and repetition levels.
> Is it at the moment possible to use this in spark as well or is it actually
> not implemented? If yes, I’m not sure how to do it. I saw some examples,
> they try to put some arrays or case classes in other case classes, nut I
> don’t think that is the right way.  The other thing that I saw in this
> relation was SchemaRDDs.
>
> Input:
>
> Col1|   Col2|   Col3|   Col4
> Int |   long|   long|   int
> -
> 14  |   1234|   1422|   3
> 14  |   3212|   1542|   2
> 14  |   8910|   1422|   8
> 15  |   1234|   1542|   9
> 15  |   8897|   1422|   13
>
> Want this Parquet-format:
> Col3|   Col1|   Col4|   Col2
> long|   int |   int |   long
> 
> 1422|   14  |   3   |   1234
> “   |   “   |   8   |   8910
> “   |   15  |   13  |   8897
> 1542|   14  |   2   |   3212
> “   |   15  |   9   |   1234
>
> It would be awesome if somebody could give me a good hint how can I do that
> or maybe a better way.
>
> Best,
> Matthes
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Jey Kottalam
Your proposed use of rdd.pipe("foo") to communicate with an external
process seems fine. The "foo" program should read its input from
stdin, perform its computations, and write its results back to stdout.
Note that "foo" will be run on the workers, invoked once per
partition, and the result will be an RDD[String] containing an entry
for each line of output from your program.

-Jey

On Fri, Sep 19, 2014 at 3:59 PM, Andy Davidson
 wrote:
> Hi Jey
>
> Many thanks for the code example. Here is what I really want to do. I want
> to use Spark Stream and python. Unfortunately pySpark does not support
> streams yet. It was suggested the way to work around this was to use an RDD
> pipe. The example bellow was a little experiment.
>
> You can think of my system as following the standard unix shell script pipe
> design
>
> Stream of data -> spark -> down stream system not implemented in spark
>
> After seeing your example code I now understand how the stdin and stdout get
> configured.
>
> It seem like pipe() does not work the way I want. I guess I could open a
> socket and write to the down stream process.
>
> Any suggestions would be greatly appreciated
>
> Thanks Andy
>
> From: Jey Kottalam 
> Reply-To: 
> Date: Friday, September 19, 2014 at 12:35 PM
> To: Andrew Davidson 
> Cc: "user@spark.apache.org" 
> Subject: Re: RDD pipe example. Is this a bug or a feature?
>
> Hi Andy,
>
> That's a feature -- you'll have to print out the return value from
> collect() if you want the contents to show up on stdout.
>
> Probably something like this:
>
> for(Iterator iter = rdd.pipe(pwd +
> "/src/main/bin/RDDPipe.sh").collect().iterator(); iter.hasNext();)
>System.out.println(iter.next());
>
>
> Hope that helps,
> -Jey
>
> On Fri, Sep 19, 2014 at 11:21 AM, Andy Davidson
>  wrote:
>
> Hi
>
> I am wrote a little java job to try and figure out how RDD pipe works.
> Bellow is my test shell script. If in the script I turn on debugging I get
> output. In my console. If debugging is turned off in the shell script, I do
> not see anything in my console. Is this a bug or feature?
>
> I am running the job locally on a Mac
>
> Thanks
>
> Andy
>
>
> Here is my Java
>
>  rdd.pipe(pwd + "/src/main/bin/RDDPipe.sh").collect();
>
>
>
> #!/bin/sh
>
>
> #
>
> # Use this shell script to figure out how spark RDD pipe() works
>
> #
>
>
> set -x # turns shell debugging on
>
> #set +x # turns shell debugging off
>
>
> while read x ;
>
> do
>
> echo RDDPipe.sh $x ;
>
> Done
>
>
>
> Here is the output if debugging is turned on
>
> $ !grep
>
> grep RDDPipe run.sh.out
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 2
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 3
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 0
>
> $
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD pipe example. Is this a bug or a feature?

2014-09-19 Thread Jey Kottalam
Hi Andy,

That's a feature -- you'll have to print out the return value from
collect() if you want the contents to show up on stdout.

Probably something like this:

for(Iterator iter = rdd.pipe(pwd +
"/src/main/bin/RDDPipe.sh").collect().iterator(); iter.hasNext();)
   System.out.println(iter.next());


Hope that helps,
-Jey

On Fri, Sep 19, 2014 at 11:21 AM, Andy Davidson
 wrote:
> Hi
>
> I am wrote a little java job to try and figure out how RDD pipe works.
> Bellow is my test shell script. If in the script I turn on debugging I get
> output. In my console. If debugging is turned off in the shell script, I do
> not see anything in my console. Is this a bug or feature?
>
> I am running the job locally on a Mac
>
> Thanks
>
> Andy
>
>
> Here is my Java
>
> rdd.pipe(pwd + "/src/main/bin/RDDPipe.sh").collect();
>
>
>
> #!/bin/sh
>
>
> #
>
> # Use this shell script to figure out how spark RDD pipe() works
>
> #
>
>
> set -x # turns shell debugging on
>
> #set +x # turns shell debugging off
>
>
> while read x ;
>
> do
>
> echo RDDPipe.sh $x ;
>
> Done
>
>
>
> Here is the output if debugging is turned on
>
> $ !grep
>
> grep RDDPipe run.sh.out
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 2
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 3
>
> + echo RDDPipe.sh 0
>
> + echo RDDPipe.sh 0
>
> $

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Jey Kottalam
Hi Aris,

A simple approach to gaining some of the benefits of an RBF kernel is
to add synthetic features to your training set. For example, if your
original data consists of 3-dimensional vectors [x, y, z], you could
compute a new 9-dimensional feature vector containing [x, y, z, x^2,
y^2, z^2, xy, xz, y*z].

This basic idea can be taken much further:
  1. http://www.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf
  2. http://arxiv.org/pdf/1109.4603.pdf

Hope that helps,
-Jey

On Thu, Sep 18, 2014 at 11:10 AM, Aris  wrote:
> Sorry to bother you guys, but does anybody have any ideas about the status
> of MLlib with a Radial Basis Function kernel for SVM?
>
> Thank you!
>
> On Tue, Sep 16, 2014 at 3:27 PM, Aris < wrote:
>
>> Hello Spark Community -
>>
>> I am using the support vector machine / SVM implementation in MLlib with
>> the standard linear kernel; however, I noticed in the Spark documentation
>> for StandardScaler is *specifically* mentions that SVMs which use the RBF
>> kernel work really well when you have standardized data...
>>
>> which begs the question, is there some kind of support for RBF kernels
>> rather than linear kernels? In small data tests using R the RBF kernel
>> worked really well, and linear kernel never converged...so I would really
>> like to use RBF.
>>
>> Thank you folks for any help!
>>
>> Aris
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: EC2 instances missing SSD drives randomly?

2014-08-19 Thread Jey Kottalam
I think you have to explicitly list the ephemeral disks in the device
map when launching the EC2 instance.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html

On Tue, Aug 19, 2014 at 11:54 AM, Andras Barjak
 wrote:
> Hi,
>
> Using the spark 1.0.1 ec2 script I launched 35 m3.2xlarge instances. (I was
> using Singapore region.) Some of the instances we got without the ephemeral
> internal (non-EBS) SSD devices that are supposed to be connected to them.
> Some of them have these drives but not all, and there is no sign from the
> outside, one can only notice this by ssh-ing into the instances and typing
> `df -l` thus it seems to be a bug to me.
> I am not sure if Amazon is not providing the drives or the Spark AMI
> configures something wrong. Do you have any idea what is going on? I neved
> faced this issue before. It is not like the drive is not formatted/mounted
> (as it was the case with the new r3 instances), they are not present
> physically. (Though the mnt and mnt2 are configured properly in fstab.)
>
> I did several tries and the result was the same: some of the instances
> launched with the drives, some without.
>
> Please, let me know if you have some ideas what to do with this inconsistent
> behaviour.
>
> András

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Anaconda Spark AMI

2014-07-03 Thread Jey Kottalam
Hi Ben,

Has the PYSPARK_PYTHON environment variable been set in
spark/conf/spark-env.sh to the path of the new python binary?

FYI, there's a /root/copy-dirs script that can be handy when updating
files on an already-running cluster. You'll want to restart the spark
cluster for the changes to take effect, as described at
https://spark.apache.org/docs/latest/ec2-scripts.html

Hope that helps,
-Jey

On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen  wrote:
> Hi All,
>
> I'm a dev a Continuum and we are developing a fair amount of tooling around
> Spark.  A few days ago someone expressed interest in numpy+pyspark and
> Anaconda came up as a reasonable solution.
>
> I spent a number of hours yesterday trying to rework the base Spark AMI on
> EC2 but sadly was defeated by a number of errors.
>
> Aggregations seemed to choke -- where as small takes executed as aspected
> (errors are linked to the gist):
>
 sc.appName
> u'PySparkShell'
 sc._conf.getAll()
> [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
> (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
> (u'spark.app.name', u'
> PySparkShell'), (u'spark.executor.extraClassPath',
> u'/root/ephemeral-hdfs/conf'), (u'spark.master',
> u'spark://.compute-1.amazonaws.com:7077')]
 file = sc.textFile("hdfs:///user/root/chekhov.txt")
 file.take(2)
> [u"Project Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov",
> u'']
>
 lines = file.filter(lambda x: len(x) > 0)
 lines.count()
> VARIOUS ERROS DISCUSSED BELOW
>
> My first thought was that I could simply get away with including anaconda on
> the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
> Doing so resulted in some strange py4j errors like the following:
>
> Py4JError: An error occurred while calling o17.partitions. Trace:
> py4j.Py4JException: Method partitions([]) does not exist
>
> At some point I also saw:
> SystemError: Objects/cellobject.c:24: bad argument to internal function
>
> which is really strange, possibly the result of a version mismatch?
>
> I had another thought of building spark from master on the AMI, leaving the
> spark directory in place, and removing the spark call from the modules list
> in spark-ec2 launch script. Unfortunately, this resulted in the following
> errors:
>
> https://gist.github.com/quasiben/da0f4778fbc87d02c088
>
> If a spark dev was willing to make some time in the near future, I'm sure
> she/he and I could sort out these issues and give the Spark community a
> python distro ready to go for numerical computing.  For instance, I'm not
> sure how pyspark calls out to launching a python session on a slave?  Is
> this done as root or as the hadoop user? (i believe i changed /etc/bashrc to
> point to my anaconda bin directory so it shouldn't really matter.  Is there
> something special about the py4j zip include in spark dir compared with the
> py4j in pypi?
>
> Thoughts?
>
> --Ben
>
>


Re: Executors not utilized properly.

2014-06-17 Thread Jey Kottalam
Hi Abhishek,

> Where mapreduce is taking 2 mins, spark is taking 5 min to complete the
job.

Interesting. Could you tell us more about your program? A "code skeleton"
would certainly be helpful.

Thanks!

-Jey


On Tue, Jun 17, 2014 at 3:21 PM, abhiguruvayya 
wrote:

> I did try creating more partitions by overriding the default number of
> partitions determined by HDFS splits. Problem is, in this case program will
> run for ever. I have same set of inputs for map reduce and spark. Where map
> reduce is taking 2 mins, spark is taking 5 min to complete the job. I
> thought because all of the executors are not being utilized properly my
> spark program is running slower than map reduce. I can provide you my code
> skeleton for your reference. Please help me with this.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7759.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Local file being refrenced in mapper function

2014-05-30 Thread Jey Kottalam
Hi Rahul,

Marcelo's explanation is correct. Here's a possible approach to your
program, in pseudo-Python:


# connect to Spark cluster
sc = SparkContext(...)

# load input data
input_data = load_xls(file("input.xls"))
input_rows = input_data['Sheet1'].rows

# create RDD on cluster
input_rdd = sc.parallelize(input_rows)

# munge RDD
result_rdd = input_rdd.map(munge_row)

# collect result RDD to local process
result_rows = result_rdd.collect()

# write output file
write_xls(file("output.xls", "w"), result_rows)



Hope that helps,
-Jey

On Fri, May 30, 2014 at 9:44 AM, Marcelo Vanzin  wrote:
> Hello there,
>
> On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin  wrote:
>> workbook = xlsxwriter.Workbook('output_excel.xlsx')
>> worksheet = workbook.add_worksheet()
>>
>> data = sc.textFile("xyz.txt")
>> # xyz.txt is a file whose each line contains string delimited by 
>>
>> row=0
>>
>> def mapperFunc(x):
>> for i in range(0,4):
>> worksheet.write(row, i , x.split(" ")[i])
>> row++
>> return len(x.split())
>>
>> data2 = data.map(mapperFunc)
>
>> Is using row in 'mapperFunc' like this is a correct way? Will it
>> increment row each time?
>
> No. "mapperFunc" will be executed somewhere else, not in the same
> process running this script. I'm not familiar with how serializing
> closures works in Spark/Python, but you'll most certainly be updating
> the local copy of "row" in the executor, and your driver's copy will
> remain at "0".
>
> In general, in a distributed execution environment like Spark you want
> to avoid as much as possible using state. "row" in your code is state,
> so to do what you want you'd have to use other means (like Spark's
> accumulators). But those are generally expensive in a distributed
> system, and to be avoided if possible.
>
>> Is writing in the excel file using worksheet.write() in side the
>> mapper function a correct way?
>
> No, for the same reasons. Your executor will have a copy of your
> "workbook" variable. So the write() will happen locally to the
> executor, and after the mapperFunc() returns, that will be discarded -
> so your driver won't see anything.
>
> As a rule of thumb, your closures should try to use only their
> arguments as input, or at most use local variables as read-only, and
> only produce output in the form of return values. There are cases
> where you might want to break these rules, of course, but in general
> that's the mindset you should be in.
>
> Also note that you're not actually executing anything here.
> "data.map()" is a transformation, so you're just building the
> execution graph for the computation. You need to execute an action
> (like collect() or take()) if you want the computation to actually
> occur.
>
> --
> Marcelo


Re: help

2014-04-25 Thread Jey Kottalam
Sorry, but I don't know where Cloudera puts the executor log files.
Maybe their docs give the correct path?

On Fri, Apr 25, 2014 at 12:32 PM, Joe L  wrote:
> hi thank you for your reply but I could not find it. it says that no such
> file or directory
>
>
> 
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: help

2014-04-25 Thread Jey Kottalam
Try taking a look at the stderr logs of the executor
"app-20140426030946-0004/8". This should be in the $SPARK_HOME/work
directory of the corresponding machine.

Hope that helps,
-Jey

On Fri, Apr 25, 2014 at 11:17 AM, Joe L  wrote:
> I need someone's help please I am getting the following error.
>
> [error] 14/04/26 03:09:47 INFO cluster.SparkDeploySchedulerBackend: Executor
> app-20140426030946-0004/8 removed: class java.io.IOException: Cannot run
> program "/home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh" (in
> directory "."): error=13
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.