RE: Escape commas in file names

2014-12-25 Thread Cheng, Hao
I’ve created a jira issue for this https://issues.apache.org/jira/browse/SPARK-4967 Originally we want to support multiple parquet file paths scanning as I guess, and those file paths are in a single string separated by comma internally, however I didn’t find any public example says we support

Re: Debian package for spark?

2014-12-25 Thread varun sharma
Hi Kevin, Were you able to build spark with command export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Pdeb -DskipTests clean package ? I am getting the below error for all versions of spark(even 1.2.0): Failed to execute goal org.vafer:jdeb:0.11:jdeb (default) on

Do you know any Spark modeling tool?

2014-12-25 Thread Haopu Wang
Hi, I think a modeling tool may be helpful because sometimes it's hard/tricky to program Spark. I don't know if there is already such a tool. Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Corrupted Exception while deserialize task

2014-12-25 Thread WangTaoTheTonic
Hi Guys, I found an excetpion while running application using 1.2.0-snapshot version. It shows like this: 2014-12-23 07:45:36,333 | ERROR | [Executor task launch worker-0] | Exception in task 0.0 in stage 0.0 (TID 0) | org.apache.spark.Logging$class.logError(Logging.scala:96)

Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread andy petrella
Nice idea, although it needs a plan on their hosting, or spark to host it if I'm not wrong. I've been using Slack for discussions, it's not exactly the same of discourse, the ML or SO but offers interesting features. It's more in the mood of IRC integrated with external services. my2c On Wed

serialization issue with mapPartitions

2014-12-25 Thread ey-chih chow
Hi, I got some issues with mapPartitions with the following piece of code: val sessions = sc .newAPIHadoopFile( ... path to an avro file ..., classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[ByteBuffer]], classOf[AvroKey[ByteBuffer]],

action progress in ipython notebook?

2014-12-25 Thread Eric Friedman
Spark 1.2.0 is SO much more usable than previous releases -- many thanks to the team for this release. A question about progress of actions. I can see how things are progressing using the Spark UI. I can also see the nice ASCII art animation on the spark driver console. Has anyone come up with

RE: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-25 Thread Tarun Garg
Thanks. I marked the variable as transient and i moved ahead now i am getting exception in execution the query. final static transient SparkConf sparkConf = new SparkConf().setAppName(NumberCount);final static transient JavaSparkContext jc = new JavaSparkContext(sparkConf);static

ReliableDeliverySupervisor: Association with remote system

2014-12-25 Thread SamyaMaiti
Hi All, I am new to both Scala Spark, so please expect some mistakes. Setup : Scala : 2.10.2 Spark : Apache 1.1.0 Hadoop : Apache 2.4 Intend of the code : To read from kafka topic do some processing. Below are the code details and error am getting. : import org.apache.spark._ import

Re: ReliableDeliverySupervisor: Association with remote system

2014-12-25 Thread SamyaMaiti
Sorry for the typo. Apache Hadoop version is 2.6.0 Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-with-remote-system-tp20859p20860.html Sent from the Apache Spark User List mailing list archive at

unable to do group by with 1st column

2014-12-25 Thread Amit Behera
Hi Users, I am reading a csv file and my data format is like : key1,value1 key1,value2 key1,value1 key1,value3 key2,value1 key2,value5 key2,value5 key2,value4 key1,value4 key1,value4 key3,value1 key3,value1 key3,value2 required output : key1:[value1,value2,value1,value3,value4,value4]

Re: Long-running job cleanup

2014-12-25 Thread Ilya Ganelin
Hello all - can anyone please offer any advice on this issue? -Ilya Ganelin On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long,

Re: How to build Spark against the latest

2014-12-25 Thread guxiaobo1982
The following command works ./make-distribution.sh --tgz -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -- Original -- From: guxiaobo1982;guxiaobo1...@qq.com; Send time: Thursday, Dec 25, 2014 3:58 PM To:

Re: unable to do group by with 1st column

2014-12-25 Thread Tobias Pfeiffer
Hi, On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera amit.bd...@gmail.com wrote: How can I do it? Please help me to do. Have you considered using groupByKey? http://spark.apache.org/docs/latest/programming-guide.html#transformations Tobias

Re: serialization issue with mapPartitions

2014-12-25 Thread Tobias Pfeiffer
Hi, On Fri, Dec 26, 2014 at 1:32 AM, ey-chih chow eyc...@hotmail.com wrote: I got some issues with mapPartitions with the following piece of code: val sessions = sc .newAPIHadoopFile( ... path to an avro file ...,

Re: serialization issue with mapPartitions

2014-12-25 Thread ey-chih chow
I should rephrase my question as follows: How to use the corresponding Hadoop Configuration of a HadoopRDD in defining a function as an input parameter to the MapPartitions function? Thanks. Ey-Chih Chow -- View this message in context:

Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread Tobias Pfeiffer
Nick, uh, I would have expected a rather heated discussion, but the opposite seems to be the case ;-) Independent of my personal preferences w.r.t. usability, habits etc., I think it is not good for a software/tool/framework if questions and discussions are spread over too many places. I guess

Re: serialization issue with mapPartitions

2014-12-25 Thread Tobias Pfeiffer
Hi, On Fri, Dec 26, 2014 at 10:13 AM, ey-chih chow eyc...@hotmail.com wrote: I should rephrase my question as follows: How to use the corresponding Hadoop Configuration of a HadoopRDD in defining a function as an input parameter to the MapPartitions function? Well, you could try to pull

fail to run spark PortfolioDemo with dse Cassandra

2014-12-25 Thread Zhang Jiaqiang
Hello All, I'm a newbie to Spark and Cassandra. I try to run the spark demo within dse-cassandra Portfoliodemo in a cluster env but cannot succeed. This issue may not really coming from spark, but I am really not sure how to investigate more on this. Please help me. There are 5 centos servers

RE: serialization issue with mapPartitions

2014-12-25 Thread Shao, Saisai
Hi, Hadoop Configuration is only Writable, not Java Serializable. You can use SerializableWritable (in Spark) to wrap the Configuration to make it serializable, and use broadcast variable to broadcast this conf to all the node, then you can use it in mapPartitions, rather than serialize it

RE: unable to do group by with 1st column

2014-12-25 Thread Somnath Pandeya
Hi , You can try reducebyKey also , Something like this JavaPairRDDString, String ones = lines .mapToPair(new PairFunctionString, String, String() { @Override public Tuple2String, String call(String s)

Profiling a spark application.

2014-12-25 Thread rapelly kartheek
Hi, I want to find the time taken for replicating an rdd in spark cluster along with the computation time on the replicated rdd. Can someone please suggest some ideas? Thank you

how to do incremental model updates using spark streaming and mllib

2014-12-25 Thread vishnu
Hi, Say I have created a clustering model using KMeans for 100million transactions at time t1. I am using streaming and say for every 1 hour i need to update my existing model. How do I do it. Should it include every time all the data or can it be incrementally updated. If I can do an

Re: fail to run spark PortfolioDemo with dse Cassandra

2014-12-25 Thread Akhil Das
Can you cross check your cassandra-rackdc.properties, cassandra-topology.properties files? It could be a miss configuration. Also its better you look at the cassandra logs to see whats happening internally. Thanks Best Regards On Fri, Dec 26, 2014 at 7:23 AM, Zhang Jiaqiang

Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread Josh Rosen
We have a mirror of the user and developer mailing lists on Nabble, but unfortunately this has led to significant usability issues because users may attempt to post messages through Nabble which silently fail to get posted to the actual Apache list and thus are never read by most subscribers: